我有一个字符串数据集,我需要将字符串的关键部分过滤到一个数组中。
data[0]
给出输出:
['<ellipse cx="32.0" cy="8.0" fill="silver" rx="16.0" ry="16.0" /',
'<ellipse cx="32.0" cy="56.0" fill="green" rx="32.0" ry="16.0" /',
'<ellipse cx="8.0" cy="8.0" fill="green" rx="16.0" ry="32.0" /']
我需要创建的是以下数组:
key_data[0] -> [['ellipse' , 32.0, 8.0, 'silver', 16.0, 16.0], [ 'ellipse', 32.0, 56.0, 'green', 32.0, 16.0], ['ellipse', 8.0, 8.0, 'green', 16.0, 32.0]]
任何建议将不胜感激
答案 0 :(得分:1)
import re
data = ['<ellipse cx="32.0" cy="8.0" fill="silver" rx="16.0" ry="16.0" /',
'<ellipse cx="32.0" cy="56.0" fill="green" rx="32.0" ry="16.0" /',
'<ellipse cx="8.0" cy="8.0" fill="green" rx="16.0" ry="32.0" /']
re_compile = re.compile(r'<(.*?) cx="(.*?)" cy="(.*?)" fill="(.*?)" rx="(.*?)" ry="(.*?)" /')
result = list(map(lambda x: re_compile.search(x).groups(), data))
print(result)
答案 1 :(得分:0)
这是使用正则表达式的一种方法。 re.match
。
例如:
import re
data = ['<ellipse cx="32.0" cy="8.0" fill="silver" rx="16.0" ry="16.0" /', '<ellipse cx="32.0" cy="56.0" fill="green" rx="32.0" ry="16.0" /', '<ellipse cx="8.0" cy="8.0" fill="green" rx="16.0" ry="32.0" /']
result = []
for i in data:
m = re.match(r"\<([a-z]+) \w+=(.*?)\s\w+=(.*?)\s\w+=(.*?)\s\w+=(.*?)\s\w+=(.*?)", i)
result.append([j.strip('"') for j in m.groups() if j])
print(result)
输出:
[['ellipse', '32.0', '8.0', 'silver', '16.0'],
['ellipse', '32.0', '56.0', 'green', '32.0'],
['ellipse', '8.0', '8.0', 'green', '16.0']]
或使用re.match
和re.findall
例如:
data = ['<ellipse cx="32.0" cy="8.0" fill="silver" rx="16.0" ry="16.0" /', '<ellipse cx="32.0" cy="56.0" fill="green" rx="32.0" ry="16.0" /', '<ellipse cx="8.0" cy="8.0" fill="green" rx="16.0" ry="32.0" /']
result = []
for i in data:
m = re.match(r"\<([a-z]+)", i)
attribs = re.findall(r'\w+="(.*?)"', i)
result.append([m.group(1)] + attribs )
答案 2 :(得分:0)
Regex
是一种更好的方法,但是如果您不想使用它,可以按照以下方式进行:
new_l = []
for i in data:
sub_l = []
el = i.split()
sub_l.append(el[0][1:])
for e in el[1:]:
x = e.split('=')
try:
sub_l.append(x[1].strip().strip('"'))
except IndexError:
continue
new_l.append(sub_l)
答案 3 :(得分:0)
这是一种不使用正则表达式的解决方案。
a = ['<ellipse cx="32.0" cy="8.0" fill="silver" rx="16.0" ry="16.0" ',
'<ellipse cx="32.0" cy="56.0" fill="green" rx="32.0" ry="16.0" ',
'<ellipse cx="8.0" cy="8.0" fill="green" rx="16.0" ry="32.0" ']
d = [i.split() for i in a]
r = [[j.split('=')[-1] for j in i] for i in d]
s = [[i.strip('"').lstrip('<') for i in k] for k in r]
# convert to float where possible
for k in s:
for i, j in enumerate(k):
try:
k[i] = float(j)
except ValueError:
pass
>>> print(s)
[['ellipse', 32.0, 8.0, 'silver', 16.0, 16.0],
['ellipse', 32.0, 56.0, 'green', 32.0, 16.0],
['ellipse', 8.0, 8.0, 'green', 16.0, 32.0]]
答案 4 :(得分:0)
如果您不习惯使用正则表达式,则可以使用以下代码。
from ast import literal_evaldef return_list(ellipse_list): final_res = [] for ellipse in ellipse_list: res = [literal_eval(item.split("=")[1]) if "=" in item else item for item in ellipse[1:-1].split(" ")[:-1]] final_res.append(res)
return final_res
答案 5 :(得分:0)
列表的内容看起来像XML,应该将XML视为XML。仅缺少最后一个'>'
。因此,我们可以使用XML解析器来处理此问题。
from lxml import etree
data = ['<ellipse cx="32.0" cy="8.0" fill="silver" rx="16.0" ry="16.0" /',
'<ellipse cx="32.0" cy="56.0" fill="green" rx="32.0" ry="16.0" /',
'<ellipse cx="8.0" cy="8.0" fill="green" rx="16.0" ry="32.0" /']
def parse_element(element):
# adjust faulty XML if needed
if not element.rstrip().endswith('>'):
element += '>'
# create the XML structure
doc = etree.XML(element)
# gather all attribute values in a dictionary
attributes = ['cx', 'cy', 'rx', 'ry', 'fill']
values = {attribute_name: doc.get(attribute_name) for attribute_name in attributes}
# construct target
entry = [doc.tag,
float(values['cx']),
float(values['cy']),
values['fill'],
float(values['rx']),
float(values['ry'])]
return entry
result = [parse_element(element) for element in data]
print(result)
如果属性的顺序更改,它将仍然运行。