有没有更有效的方法将数据从此xml文件获取到pandas数据帧?

时间:2016-05-18 15:19:01

标签: python xml pandas

我们有一些数据存储在许多xml文件中。下面给出了数据开始的一个例子:

<?xml version="1.0" encoding="utf-8"?>
<PTW xmlns:ptw="http://www.ptw.de/QUICKCHECK">
  <Version>1.1.0.0</Version>
  <LastModified>2009-10-20 13:45:50</LastModified>
  <Content>
    <TrendData id="730329885" date="2009-10-19 06:35:21" device="00001">
  <Worklist id="472561931">
    <Name>new Worklist</Name>
    <AdminData id="1295858653">
      <AdminValues>
        <TreatmentUnit>Linac1</TreatmentUnit>
        <Protocol Flat="1" Sym="1">Userdefined</Protocol>
        <Modality>Photons</Modality>
        <Energy>12</Energy>
        <Fieldsize>200x200</Fieldsize>
        <SDD>1000</SDD>
        <Gantry>0</Gantry>
        <Wedge>0</Wedge>
        <MU>0</MU>
        <My>1</My>
        <Info />
        <Comment />
      </AdminValues>
      <AnalyzeParams>
        <CAX>
          <Min>9.7000E+01</Min>
          <Max>1.0300E+02</Max>
          <Target>1.0000E+02</Target>
          <Norm>1.0081E+02</Norm>
        </CAX>........ etc etc

目前我有一个函数将一些内容转换为pandas数据帧,开始如下:

def qcw_to_df(filename):
    """Read data from a qcw file and returns a Pandas DataFrame
    Use as dataset = parse_qcw(file)"""

    # open the data file
    xmlData = etree.parse(filename)

    trendData = xmlData.findall("//TrendData")

然后我为每个测量所需的每个参数创建一个空列表:

    date = []
    time = []
    linac = []
    modality = []
    energy = []
    fieldsize = []
    CAX = []
    flatness = []
    symmetryGT = []
    symmetryAB = []
    bqf = []
    measCAX = []
    G10 = []
    L10 = []
    T10 = []
    R10 = []
    G20 = []
    L20 = []
    T20 = []
    R20 = []
    E1 = []
    E2 = []
    E3 = []
    E4 = []
    Temp = []
    Pressure = []

...并且读取将每个测量附加到适当列表的数据文件

    for meas in trendData:

        linac_id = meas.findtext("Worklist/AdminData/AdminValues/TreatmentUnit")
        modality_id = meas.findtext("Worklist/AdminData/AdminValues/Modality")
        energy_id = meas.findtext("Worklist/AdminData/AdminValues/Energy")
        fieldsize_id = meas.findtext("Worklist/AdminData/AdminValues/Fieldsize")
                        # first, get the data in the "tags" of the record
                        # need to split datetime into separate date and time fields
                        # names starting read_ are the raw data from the qw file.
        read_measDate = meas.attrib['date']
        read_date = dateutil.parser.parse(read_measDate)
        date.append(read_date)

        read_TreatmentUnit = meas.findtext("Worklist/AdminData/AdminValues/TreatmentUnit")
        linac.append(read_TreatmentUnit)

        read_Modality = meas.findtext("Worklist/AdminData/AdminValues/Modality")
        modality.append(read_Modality)

        read_Energy = meas.findtext("Worklist/AdminData/AdminValues/Energy")
    energy.append(read_Energy)

        read_Fieldsize = meas.findtext("Worklist/AdminData/AdminValues/Fieldsize")
    fieldsize.append(read_Fieldsize)

        # measured values
        read_CAX = meas.findtext("MeasData/AnalyzeValues/CAX/Value")
        if read_CAX == "0.0000E+00":
            read_CAX = ""
        else:
            read_CAX = float(read_CAX)
        CAX.append(read_CAX)

        read_Flatness = meas.findtext("MeasData/AnalyzeValues/Flatness/Value")
        if read_Flatness == "0.0000E+00":
            read_Flatness = ""
        else:
            read_Flatness = float(read_Flatness)
        flatness.append(read_Flatness)

我们需要读出的每个测量值都会继续 最后,一旦读取了每个列表,就会构建一个数据帧

df=pd.DataFrame({'date':date, 'linac':linac, 'modality':modality, 'energy':energy, 'fieldsize':fieldsize, 'CAX':CAX,
              'flatness':flatness, 'symmetryGT':symmetryGT, 'symmetryAB':symmetryAB, 'bqf':bqf, 'measCAX':measCAX,
              'G10':G10, 'L10':L10, 'T10':T10, 'R10':R10, 'G20':G20, 'L20':L20, 'T20':T20, 'R20':R20, 'E1':E1,
              'E2':E2, 'E3':E3, 'E4':E4, 'Temp':Temp, 'Pressure':Pressure})
df = df.set_index('date')
return df

虽然这有效,但它确实让我觉得非常低效。这也意味着如果我想提取文件中的其他数据,我必须在另一个空列表中添加,另一个findtext行并扩展创建数据帧的行。

是否有更有效的方法从文件中提取所有数据,而无需专门为每个项目命名?

0 个答案:

没有答案
相关问题