Question

我在更大的文档中有以下HTML

<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />

我目前正在使用BeautifulSoup来获取HTML中的其他元素，但我无法找到在<br />标记之间获取重要文本行的方法。我可以隔离并导航到每个<br />元素，但无法找到介于两者之间的文本的方法。任何帮助将不胜感激。感谢。

Answer 1

如果您只想要两个<br />标签之间的任何文字，您可以执行以下操作：

from BeautifulSoup import BeautifulSoup, NavigableString, Tag

input = '''<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />'''

soup = BeautifulSoup(input)

for br in soup.findAll('br'):
    next_s = br.nextSibling
    if not (next_s and isinstance(next_s,NavigableString)):
        continue
    next2_s = next_s.nextSibling
    if next2_s and isinstance(next2_s,Tag) and next2_s.name == 'br':
        text = str(next_s).strip()
        if text:
            print "Found:", next_s

但也许我误解了你的问题？您对问题的描述似乎与示例数据中的“重要”/“非重要”不符，所以我已经删除了描述;）

Answer 2

因此，出于测试目的，我们假设这个HTML块位于span标记内：

x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""

现在我要解析它并找到我的span标记：

from BeautifulSoup import BeautifulSoup
y = soup.find('span')

如果你在y.childGenerator()中迭代生成器，你将获得br和文本：

In [4]: for a in y.childGenerator(): print type(a), str(a)
   ....: 
<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 1

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Not Important Text

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 2

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 3

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Non Important Text

<type 'instance'> <br />
<class 'BeautifulSoup.NavigableString'> 
Important Text 4

<type 'instance'> <br />

Answer 3

以下对我有用：

for br in soup.findAll('br'):
    if str(type(br.contents[0])) == '<class \'BeautifulSoup.NavigableString\'>':
       print br.contents[0]

Answer 4

对 Ken Kinder 的回答略有改进。您可以改为访问 BeautifulSoup 元素的 stripped_strings 属性。例如，假设您的特定 HTML 块位于 span 标记内：


x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""

首先我们用 BeautifulSoup 解析 x。然后查找元素，在本例中为 span，然后访问 stripped_strings 属性。像这样，

from bs4 import BeautifulSoup
soup = BeautifulSoup(x)
span = soup.find("span")
text = list(span.stripped_strings)

现在 print(text) 将给出以下输出：

['Important Text 1',
 'Not Important Text',
 'Important Text 2',
 'Important Text 3',
 'Non Important Text',
 'Important Text 4']

使用beautifulsoup在换行符之间提取文本（例如<br/>标签）

4 个答案: