Question

我有一些html页面来抓取数据。所以我需要在这里获得项目标题：'Caliper Ring'。我从标签出现该标题的数据：

item_title = base_page.find_all('h1', class_='itemTitle')

它包含这些标签结构：

> [<h1 class="itemTitle"> <div class="l1">Caliper</div>
>                                 Ball
>                             </h1>]

使用

提取'Caliper Ball'

    collector = []
    for _ in item_title:
        collector.append(_.text)

所以我在收藏家名单中得到如此丑陋的输出：

[u"\nCaliper\r\n                                Ball\r\n                            "]

如何使输出清晰如此“Caliper Ball”

Answer 1

请勿使用regex。你为简单的东西增加了太多的开销。 BeautifulSoup4已经有了一些名为stripped_strings的内容。请参阅下面的代码。

from bs4 import BeautifulSoup as bsoup

html = """[<h1 class="itemTitle"> <div class="l1">Caliper</div>
                               Ball
                           </h1>]"""
soup = bsoup(html)
soup.prettify()

item = soup.find("h1", class_="itemTitle")
base =  list(item.stripped_strings)
print " ".join(base)

结果：

Caliper Ball
[Finished in 0.5s]

说明：stripped_strings基本上获取指定标签内的所有文本，剥去所有空格，换行符，你有什么。它返回一个生成器，我们可以使用list来捕获它，因此它返回一个列表。一旦它成为一个列表，只需使用" ".join。

如果有帮助，请告诉我们。

PS：只是为了纠正自己 - 实际上没有必要对list的结果使用stripped_strings，但最好是显示上述内容，因此它是明确的。

Answer 2

此正则表达式将帮助您获得输出（Caliper Ball），

import re
str="""[<h1 class="itemTitle"> <div class="l1">Caliper</div>
                                 Ball 
                             </h1>]"""
regex = r'.*>([^<]*)<\/div>\s*\n\s*(\w*).*'
match = re.findall(regex, str)
new_data = (' '.join(w) for w in match)
print ''.join(new_data) # => Caliper Ball

Answer 3

您可以使用替换（）方法将 \ n 和 \ r 替换为空格或空格，然后使用修剪（）到remvoe空间。

用漂亮的汤从html标签中提取文本

3 个答案: