Question

我有以下xml

<url>
     <loc>https://mystore.com/products-t-shirt.xml</loc>
     <lastmod>2019-04-11T00:01:42-04:00</lastmod>
     <changefreq>daily</changefreq>
     <image:image>
         <image:loc> http://some-imageurl.com
         </image:loc>
         <image:title>GIFTS</image:title>
         <image:caption>quirky caption</image:caption>
     </image:image>
</url>

，而我尝试仅提取“ loc”标签。

我已使用以下代码执行此操作 products_list = soup.find_all(lambda tag: tag.name == "loc") 并且我尝试使用soup.find_all(re.compile("\\bloc\\b"))，但是当我返回此数组结果时，结果中同时具有loc标签和image：loc标签（当然还有那些标签文本）。有人知道美丽的汤在抢夺image：loc吗，即使我指定我想要一个确切的字符串呢？

Answer 1

这假设您使用的是Beautiful Soup 4.7 +。

您实际上可以使用选择器来定位。您所显示的似乎是XML，因此我假设您的文档image中某处定义了名称空间。对于此示例，我们假设命名空间定义为xmlns:image="http://somenamespace.com"，这意味着image前缀（在:之前的前缀）表示http://somenamespace.com命名空间。我们将假设loc不带名称空间。最后，我们将使用|loc来指定我们希望loc没有命名空间：

from bs4 import BeautifulSoup
xml = """
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:image="http://somenamespace.com">
<url>
     <loc>https://mystore.com/products-t-shirt.xml</loc>
     <lastmod>2019-04-11T00:01:42-04:00</lastmod>
     <changefreq>daily</changefreq>
     <image:image>
         <image:loc> http://some-imageurl.com
         </image:loc>
         <image:title>GIFTS</image:title>
         <image:caption>quirky caption</image:caption>
     </image:image>
</url>
</root>
"""

soup = BeautifulSoup(xml, 'xml')

print(soup.select('|loc'))

输出

[<loc>https://mystore.com/products-t-shirt.xml</loc>]

但是，如果loc的命名空间未分配前缀，我们仍然可以将其定位。假设它具有默认的命名空间xmlns="http://default.com"。没有为我们想要的loc分配前缀，因此在此示例中，它将继承我们的默认名称空间。

文档中的前缀仅对解析器有意义，因此我们可以为目标名称空间指定一个任意前缀名称供选择器使用，我们将其称为default。然后，我们可以使用loc定位default|loc标签。

from bs4 import BeautifulSoup
xml = """
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://default.com" xmlns:image="http://somenamespace.com">
<url>
     <loc>https://mystore.com/products-t-shirt.xml</loc>
     <lastmod>2019-04-11T00:01:42-04:00</lastmod>
     <changefreq>daily</changefreq>
     <image:image>
         <image:loc> http://some-imageurl.com
         </image:loc>
         <image:title>GIFTS</image:title>
         <image:caption>quirky caption</image:caption>
     </image:image>
</url>
</root>
"""

soup = BeautifulSoup(xml, 'xml')

print(soup.select('default|loc', namespaces={'default': 'http://default.com'}))

输出

[<loc>https://mystore.com/products-t-shirt.xml</loc>]

您甚至可以将其定义为不带前缀的默认名称空间，然后将其定位为loc：

from bs4 import BeautifulSoup
xml = """
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://default.com" xmlns:image="http://somenamespace.com">
<url>
     <loc>https://mystore.com/products-t-shirt.xml</loc>
     <lastmod>2019-04-11T00:01:42-04:00</lastmod>
     <changefreq>daily</changefreq>
     <image:image>
         <image:loc> http://some-imageurl.com
         </image:loc>
         <image:title>GIFTS</image:title>
         <image:caption>quirky caption</image:caption>
     </image:image>
</url>
</root>
"""

soup = BeautifulSoup(xml, 'xml')

print(soup.select('loc', namespaces={'': 'http://default.com'}))

输出

[<loc>https://mystore.com/products-t-shirt.xml</loc>]

对于那些不想使用选择器的人，还可以检查元素的prefix。在这种情况下，我们希望loc不带前缀：

from bs4 import BeautifulSoup
import re
xml = """
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://default.com" xmlns:image="http://somenamespace.com">
<url>
     <loc>https://mystore.com/products-t-shirt.xml</loc>
     <lastmod>2019-04-11T00:01:42-04:00</lastmod>
     <changefreq>daily</changefreq>
     <image:image>
         <image:loc> http://some-imageurl.com
         </image:loc>
         <image:title>GIFTS</image:title>
         <image:caption>quirky caption</image:caption>
     </image:image>
</url>
</root>
"""

soup = BeautifulSoup(xml, 'xml')

print([el for el in soup.find_all('loc') if not el.prefix])

Answer 2

我尝试了这种设置，我的输出是：[<loc>https://mystore.com/products-t-shirt.xml</loc>]

首先，我加载了一个文件，其中包含您的字符串。但是我必须进行一些更正：文件：test.xml

<?xml version="1.0" encoding="UTF-8"?>
    <url xmlns:image=" ">
        <loc>https://mystore.com/products-t-shirt.xml</loc>
        <lastmod>2019 - 04 - 11
        T00: 01:42 - 04: 00
        </lastmod>
        <changefreq>daily</changefreq>
        <image: image="">
        <image loc="">http://some-imageurl.com
        </image>
        <image: title="">GIFTS</image:>
        <image: caption="">quirky caption</image:>
        </image:>
    </url>

这里是python中的代码

import bs4 as BS

if __name__ == "__main__":

    with open("test.xml", "r") as f:
        xml = f.read()
    soup = BS.BeautifulSoup(xml, "lxml")
    tag_selection = soup.find_all(lambda tag: tag.name == "loc")
    print(tag_selection)

正如您在输出中看到的那样，检索到的唯一字符串只是loc标签。

我希望对您有帮助

漂亮的汤find_all（）方法获取的标签数量超出了过滤器指定的数量

2 个答案: