我们有一些数据存储在许多xml文件中。下面给出了数据开始的一个例子:
<?xml version="1.0" encoding="utf-8"?>
<PTW xmlns:ptw="http://www.ptw.de/QUICKCHECK">
<Version>1.1.0.0</Version>
<LastModified>2009-10-20 13:45:50</LastModified>
<Content>
<TrendData id="730329885" date="2009-10-19 06:35:21" device="00001">
<Worklist id="472561931">
<Name>new Worklist</Name>
<AdminData id="1295858653">
<AdminValues>
<TreatmentUnit>Linac1</TreatmentUnit>
<Protocol Flat="1" Sym="1">Userdefined</Protocol>
<Modality>Photons</Modality>
<Energy>12</Energy>
<Fieldsize>200x200</Fieldsize>
<SDD>1000</SDD>
<Gantry>0</Gantry>
<Wedge>0</Wedge>
<MU>0</MU>
<My>1</My>
<Info />
<Comment />
</AdminValues>
<AnalyzeParams>
<CAX>
<Min>9.7000E+01</Min>
<Max>1.0300E+02</Max>
<Target>1.0000E+02</Target>
<Norm>1.0081E+02</Norm>
</CAX>........ etc etc
目前我有一个函数将一些内容转换为pandas数据帧,开始如下:
def qcw_to_df(filename):
"""Read data from a qcw file and returns a Pandas DataFrame
Use as dataset = parse_qcw(file)"""
# open the data file
xmlData = etree.parse(filename)
trendData = xmlData.findall("//TrendData")
然后我为每个测量所需的每个参数创建一个空列表:
date = []
time = []
linac = []
modality = []
energy = []
fieldsize = []
CAX = []
flatness = []
symmetryGT = []
symmetryAB = []
bqf = []
measCAX = []
G10 = []
L10 = []
T10 = []
R10 = []
G20 = []
L20 = []
T20 = []
R20 = []
E1 = []
E2 = []
E3 = []
E4 = []
Temp = []
Pressure = []
...并且读取将每个测量附加到适当列表的数据文件
for meas in trendData:
linac_id = meas.findtext("Worklist/AdminData/AdminValues/TreatmentUnit")
modality_id = meas.findtext("Worklist/AdminData/AdminValues/Modality")
energy_id = meas.findtext("Worklist/AdminData/AdminValues/Energy")
fieldsize_id = meas.findtext("Worklist/AdminData/AdminValues/Fieldsize")
# first, get the data in the "tags" of the record
# need to split datetime into separate date and time fields
# names starting read_ are the raw data from the qw file.
read_measDate = meas.attrib['date']
read_date = dateutil.parser.parse(read_measDate)
date.append(read_date)
read_TreatmentUnit = meas.findtext("Worklist/AdminData/AdminValues/TreatmentUnit")
linac.append(read_TreatmentUnit)
read_Modality = meas.findtext("Worklist/AdminData/AdminValues/Modality")
modality.append(read_Modality)
read_Energy = meas.findtext("Worklist/AdminData/AdminValues/Energy")
energy.append(read_Energy)
read_Fieldsize = meas.findtext("Worklist/AdminData/AdminValues/Fieldsize")
fieldsize.append(read_Fieldsize)
# measured values
read_CAX = meas.findtext("MeasData/AnalyzeValues/CAX/Value")
if read_CAX == "0.0000E+00":
read_CAX = ""
else:
read_CAX = float(read_CAX)
CAX.append(read_CAX)
read_Flatness = meas.findtext("MeasData/AnalyzeValues/Flatness/Value")
if read_Flatness == "0.0000E+00":
read_Flatness = ""
else:
read_Flatness = float(read_Flatness)
flatness.append(read_Flatness)
我们需要读出的每个测量值都会继续 最后,一旦读取了每个列表,就会构建一个数据帧
df=pd.DataFrame({'date':date, 'linac':linac, 'modality':modality, 'energy':energy, 'fieldsize':fieldsize, 'CAX':CAX,
'flatness':flatness, 'symmetryGT':symmetryGT, 'symmetryAB':symmetryAB, 'bqf':bqf, 'measCAX':measCAX,
'G10':G10, 'L10':L10, 'T10':T10, 'R10':R10, 'G20':G20, 'L20':L20, 'T20':T20, 'R20':R20, 'E1':E1,
'E2':E2, 'E3':E3, 'E4':E4, 'Temp':Temp, 'Pressure':Pressure})
df = df.set_index('date')
return df
虽然这有效,但它确实让我觉得非常低效。这也意味着如果我想提取文件中的其他数据,我必须在另一个空列表中添加,另一个findtext行并扩展创建数据帧的行。
是否有更有效的方法从文件中提取所有数据,而无需专门为每个项目命名?