Question

示例输入：

<subj code1="textA" code2="textB" code3="textC">
    <txt count="1">
        <txt id="123">
            This is my text.
        </txt>
    </txt>
</subj>

我正在尝试使用BeautifulSoup将XML中的信息提取到CSV中。我想要的输出是

code1,code2,code3,txt
textA,textB,textC,This is my text.

我一直在玩这个示例代码，我发现here：它适用于提取txt而不是标记subj中的code1，code2，code3。

if __name__ == '__main__':
    with open('sample.csv', 'w') as fhandle:
        writer = csv.writer(fhandle)
        writer.writerow(('code1', 'code2', 'code3', 'text'))
        for subj in soup.find_all('subj'):
            for x in subj:
                writer.writerow((subj.code1.text,
                                subj.code2.text,
                                subj.code3.text,
                                subj.txt.txt))

但是，我无法理解我想要提取的subj中的属性。有什么建议吗？

Answer 1

code1，code2和code3不是文字，它们是属性。

要访问它们，treat an element as a dictionary：

(subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)))

演示：

In [1]: from bs4 import BeautifulSoup

In [2]: data = """
   ...: <subj code1="textA" code2="textB" code3="textC">
   ...:     <txt count="1">
   ...:         <txt id="123">
   ...:             This is my text.
   ...:         </txt>
   ...:     </txt>
   ...: </subj>
   ...: """

In [3]: soup = BeautifulSoup(data, "xml")
In [4]: for subj in soup('subj'):
    ...:     print([subj['code1'], subj['code2'], subj['code3'], subj.get_text(strip=True)])  
['textA', 'textB', 'textC', 'This is my text.']

如果缺少某个属性，您还可以使用.get()提供默认值：

subj.get('code1', 'Default value for code1')

使用BeautifulSoup4提取XML标记中的属性

1 个答案: