删除子标签,但将文本保留在xml中?

时间:2019-07-09 06:35:59

标签: python regex xml

我有一个看起来像这样的xml

<?xml version='1.0' encoding='utf8'?>
<all>
<articletitle>text1<x> </x></articletitle>
<affiliation><x> </x><label id="aff1">12</label><affnorg>College of Materials Science and Engineering</affnorg><x>, </x><affnorg>Guangdong Research Center for Interfacial Engineering of Functional Materials</affnorg><x>, </x><affnorg>Shenzhen University</affnorg><x>, </x><affnadd>3688 Nanhai Ave</affnadd><x>, </x><affncity>Shenzhen</affncity><x>, </x><affnpost>518060</affnpost><x>, </x><affncountry>PR China</affncountry><x>.</x></affiliation>
<affiliation><x> </x><label id="aff2">2</label><affnorg>Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province</affnorg><x>, </x><affnorg>College of Optoelectronic Engineering</affnorg><x>, </x><affnorg>Shenzhen University</affnorg><x>, </x><affnadd>3688 Nanhai Ave</affnadd><x>, </x><affncity>Shenzhen</affncity><x>, </x><affnpost>518060</affnpost><x>, </x><affncountry>PR China</affncountry><x>.</x></affiliation>
</all>

任务是我必须删除所有<x>标签并将它们的文本仅保留在affiliation标签中,使用ElementTree可以删除标签,但是它也将删除文本,但是我想要该文本位于父标记中,所以我的新xml看起来像这样

<?xml version='1.0' encoding='utf8'?>
<all>
<articletitle>text1<x> </x></articletitle>
<affiliation> <label id="aff1">12</label><affnorg>College of Materials Science and Engineering</affnorg>, <affnorg>Guangdong Research Center for Interfacial Engineering of Functional Materials</affnorg>, <affnorg>Shenzhen University</affnorg>, <affnadd>3688 Nanhai Ave</affnadd>, <affncity>Shenzhen</affncity>, <affnpost>518060</affnpost>, <affncountry>PR China</affncountry>.</affiliation>
<affiliation> <label id="aff2">2</label><affnorg>Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province</affnorg>, <affnorg>College of Optoelectronic Engineering</affnorg>, <affnorg>Shenzhen University</affnorg>, <affnadd>3688 Nanhai Ave</affnadd>, <affncity>Shenzhen</affncity>, <affnpost>518060</affnpost>, <affncountry>PR China</affncountry>.</affiliation>
</all>

1 个答案:

答案 0 :(得分:1)

通过BeautifulSoup,您可以使用unwrap()函数:

data = '''<?xml version='1.0' encoding='utf8'?>
<all>
<articletitle>text1<x> </x></articletitle>
<affiliation><x> </x><label id="aff1">12</label><affnorg>College of Materials Science and Engineering</affnorg><x>, </x><affnorg>Guangdong Research Center for Interfacial Engineering of Functional Materials</affnorg><x>, </x><affnorg>Shenzhen University</affnorg><x>, </x><affnadd>3688 Nanhai Ave</affnadd><x>, </x><affncity>Shenzhen</affncity><x>, </x><affnpost>518060</affnpost><x>, </x><affncountry>PR China</affncountry><x>.</x></affiliation>
<affiliation><x> </x><label id="aff2">2</label><affnorg>Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province</affnorg><x>, </x><affnorg>College of Optoelectronic Engineering</affnorg><x>, </x><affnorg>Shenzhen University</affnorg><x>, </x><affnadd>3688 Nanhai Ave</affnadd><x>, </x><affncity>Shenzhen</affncity><x>, </x><affnpost>518060</affnpost><x>, </x><affncountry>PR China</affncountry><x>.</x></affiliation>
</all>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data,'xml')

for x in soup.select('affiliation x'):
    x.unwrap()

print(soup)

打印:

<?xml version="1.0" encoding="utf-8"?>
<all>
<articletitle>text1<x> </x></articletitle>
<affiliation> <label id="aff1">12</label><affnorg>College of Materials Science and Engineering</affnorg>, <affnorg>Guangdong Research Center for Interfacial Engineering of Functional Materials</affnorg>, <affnorg>Shenzhen University</affnorg>, <affnadd>3688 Nanhai Ave</affnadd>, <affncity>Shenzhen</affncity>, <affnpost>518060</affnpost>, <affncountry>PR China</affncountry>.</affiliation>
<affiliation> <label id="aff2">2</label><affnorg>Key Laboratory of Optoelectronic Devices and Systems of Ministry of Education and Guangdong Province</affnorg>, <affnorg>College of Optoelectronic Engineering</affnorg>, <affnorg>Shenzhen University</affnorg>, <affnadd>3688 Nanhai Ave</affnadd>, <affncity>Shenzhen</affncity>, <affnpost>518060</affnpost>, <affncountry>PR China</affncountry>.</affiliation>
</all>