Question

我期待代码：

html = """
<th scope="row">Fruits<br />
<i><a href="#Fruits">Buy</a></i></th>
<td><a href="banana.html" color="yellow">Banana</a><br />
    <a href="kiwi.html" color="green">Kiwi</a><br />
    <a href="Persimmon" color="orange">Persimmon</a><br />
</tr>
"""

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

th_node = soup.find('th', { 'scope' : 'row' }, text = re.compile('^Fruits'))
td_node = th_node.find('td')
fruits = td_node.find_all('a')
for f in fruits:
    print f['color'], " ", f.text

要打印：

yellow banana
green kiwi
orange Persimmon

我出错了什么？

Answer 1

你做错了因为：

th_node = soup.find('th', { 'scope' : 'row' }, text = re.compile('^Fruits'))
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

来自this answer：

您需要使用混合方法，因为当元素具有子元素和文本时，text=将失败。

例如：

>>> a = '<th scope="row">foo</th>'
>>> b = '<th scope="row">foo<td>bar</td></th>'
>>> BeautifulSoup(a, "html.parser").find('th', {'scope': 'row'}, text='foo')
<th scope="row">foo</th>

>>> BeautifulSoup(b, "html.parser").find('th', {'scope': 'row'}, text='foo')
>>> BeautifulSoup(b, "html.parser").find('th', {'scope': 'row'}, text='foobar')

请参阅td标记中th标记时，BeautifulSoup失败。所以我们需要（这个想法也来自那个答案）：

html = """
<th scope="row">Fruits<br />
<i><a href="#Fruits">Buy</a></i></th>
<td><a href="banana.html" color="yellow">Banana</a><br />
    <a href="kiwi.html" color="green">Kiwi</a><br />
    <a href="Persimmon" color="orange">Persimmon</a><br />
</tr>
"""

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

reg = re.compile(r'^Fruits')
th_node = [e for e in soup.find_all(
    'th', {'scope': 'row'}) if reg.search(e.text)][0]

print th_node

输出：

<th scope="row">Fruits<br/>
<i><a href="#Fruits">Buy</a></i></th>

是的，这不是您想要的，因为td标记不在th标记内。所以现在我们可以像这样使用tag.find_next()方法：

html = """
<th scope="row">Fruits<br />
<i><a href="#Fruits">Buy</a></i></th>
<td><a href="banana.html" color="yellow">Banana</a><br />
    <a href="kiwi.html" color="green">Kiwi</a><br />
    <a href="Persimmon" color="orange">Persimmon</a><br />
</tr>
"""

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

reg = re.compile(r'^Fruits')
th_node = [e for e in soup.find_all(
    'th', {'scope': 'row'}) if reg.search(e.text)][0]

td_node = th_node.find_next('td')
fruits = td_node.find_all('a')

for f in fruits:
    print f['color'], " ", f.text

输出：

yellow   Banana
green   Kiwi
orange   Persimmon

然后我们完成了！

Answer 2

如果您需要检查attrs节点值，则可以仅使用lambda（简单）或混合使用attrs和text -

 html = """
    <th scope="row">Fruits<br />
    <i><a href="#Fruits">Buy</a></i></th>
    <td><a href="banana.html" color="yellow">Banana</a><br />
        <a href="kiwi.html" color="green">Kiwi</a><br />
        <a href="Persimmon" color="orange">Persimmon</a><br />
    </tr>
    """

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

th_node = soup.find('th', { 'scope' : 'row' })#OR th_node = soup.find('th', { 'scope' : 'row' },lambda x: x.text.startswith('Fruits'))
td_node = th_node.findNext('td')
fruits = td_node.find_all('a')
for f in fruits:
    print f['color'], " ", f.text

Answer 3

您需要将class添加到href元素，正确的源代码如下：

from bs4 import BeautifulSoup

html = ""
html += "<table><th scope='row'>Fruits<br /><i><a href='#Fruits'>Buy</a></i></th>"
html += "<tr><td><a class='fruits' href='banana.html' color='yellow'>Banana</a><br/>"
html += "<a class='fruits' href='kiwi.html' color='green'>Kiwi</a><br/>"
html += "<a class='fruits' href='Persimmon' color='orange'>Persimmon</a><br/>"
html += "</tr></table>"

soup = BeautifulSoup(html,"html.parser")
for link in soup.findAll('a',{'class':'fruits'}):
    col = link.get('color')
    name = link.string
    print(col + " " + name)

Answer 4

它不起作用的原因是beautifulsoup正在比较你的正则表达式：

>>> def f(s):
...     print "comparing", s
... 
>>> soup.find("th", text=f)
comparing None
None

BeautifulSoup通过标签，属性，RegEx和迭代扫描HTML

4 个答案: