Question

我有一个XML页面，我想将其分成多个部分，然后从每个部分中提取文本，然后将它们分离成一个.txt文件，其保存的名称从001开始，直到099。例如，我想要所有名为001的文件中第1节的内容，以及名为002的文件中第2节的所有内容，依此类推。这是我到目前为止的内容：

import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://www.govinfo.gov/bulkdata/CFR/2018/title-49/CFR-2018-title49-vol1.xml/').read()

soup = bs.BeautifulSoup(source,'lxml')

for paragraph in soup.find_all('section'):
print(paragraph.string)
print(str(paragraph.text))

我想知道我可以用来创建增量txt文件输出以及将节保存在各自文件中的方法。

Answer 1

要将所有部分组合在一起，可以使用Python的groupby()函数。这具有从段落中提取节号的功能。然后，groupby函数创建一个具有相同节号的所有段落的列表，并将它们一起返回：

from itertools import groupby
import bs4 as bs
import urllib.request

def section(paragraph):
    return paragraph.sectno.text.strip('§ ').split('.')[0]


source = urllib.request.urlopen('https://www.govinfo.gov/bulkdata/CFR/2018/title-49/CFR-2018-title49-vol1.xml/').read()
soup = bs.BeautifulSoup(source, 'lxml')

for section_number, paragraphs in groupby(soup.find_all('section'), section):
    filename = f'Section {int(section_number):02}.txt'

    with open(filename, 'w', encoding='utf-8') as f_output:
        section_text = '\n-------------\n'.join(p.text for p in paragraphs)
        f_output.write(section_text)

此处文件如下所示：

Section 01.txt
Section 03.txt
Section 05.txt
Section 06.txt
Section 07.txt
Section 08.txt
...
Section 10.txt
Section 80.txt
Section 89.txt
Section 91.txt
Section 92.txt
Section 93.txt
Section 98.txt
Section 99.txt

每个段落也用一小行隔开。

使用Python和beautifulSoup从XML输出创建多个txt文件

1 个答案: