BeautifulSoup通过标签,属性,RegEx和迭代扫描HTML

时间:2015-11-21 05:12:38

标签: python beautifulsoup

我期待代码:

html = """
<th scope="row">Fruits<br />
<i><a href="#Fruits">Buy</a></i></th>
<td><a href="banana.html" color="yellow">Banana</a><br />
    <a href="kiwi.html" color="green">Kiwi</a><br />
    <a href="Persimmon" color="orange">Persimmon</a><br />
</tr>
"""

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

th_node = soup.find('th', { 'scope' : 'row' }, text = re.compile('^Fruits'))
td_node = th_node.find('td')
fruits = td_node.find_all('a')
for f in fruits:
    print f['color'], " ", f.text

要打印:

yellow banana
green kiwi
orange Persimmon

我出错了什么?

4 个答案:

答案 0 :(得分:2)

你做错了因为:

th_node = soup.find('th', { 'scope' : 'row' }, text = re.compile('^Fruits'))
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

来自this answer

  

您需要使用混合方法,因为当元素具有子元素和文本时,text=将失败。

例如:

>>> a = '<th scope="row">foo</th>'
>>> b = '<th scope="row">foo<td>bar</td></th>'
>>> BeautifulSoup(a, "html.parser").find('th', {'scope': 'row'}, text='foo')
<th scope="row">foo</th>

>>> BeautifulSoup(b, "html.parser").find('th', {'scope': 'row'}, text='foo')
>>> BeautifulSoup(b, "html.parser").find('th', {'scope': 'row'}, text='foobar')

请参阅td标记中th标记时,BeautifulSoup失败。所以我们需要(这个想法也来自那个答案):

html = """
<th scope="row">Fruits<br />
<i><a href="#Fruits">Buy</a></i></th>
<td><a href="banana.html" color="yellow">Banana</a><br />
    <a href="kiwi.html" color="green">Kiwi</a><br />
    <a href="Persimmon" color="orange">Persimmon</a><br />
</tr>
"""

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

reg = re.compile(r'^Fruits')
th_node = [e for e in soup.find_all(
    'th', {'scope': 'row'}) if reg.search(e.text)][0]

print th_node

输出:

<th scope="row">Fruits<br/>
<i><a href="#Fruits">Buy</a></i></th>

是的,这不是您想要的,因为td标记不在th标记内。所以现在我们可以像这样使用tag.find_next()方法:

html = """
<th scope="row">Fruits<br />
<i><a href="#Fruits">Buy</a></i></th>
<td><a href="banana.html" color="yellow">Banana</a><br />
    <a href="kiwi.html" color="green">Kiwi</a><br />
    <a href="Persimmon" color="orange">Persimmon</a><br />
</tr>
"""

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

reg = re.compile(r'^Fruits')
th_node = [e for e in soup.find_all(
    'th', {'scope': 'row'}) if reg.search(e.text)][0]

td_node = th_node.find_next('td')
fruits = td_node.find_all('a')

for f in fruits:
    print f['color'], " ", f.text

输出:

yellow   Banana
green   Kiwi
orange   Persimmon

然后我们完成了!

答案 1 :(得分:0)

如果您需要检查attrs节点值,则可以仅使用lambda(简单)或混合使用attrstext -

 html = """
    <th scope="row">Fruits<br />
    <i><a href="#Fruits">Buy</a></i></th>
    <td><a href="banana.html" color="yellow">Banana</a><br />
        <a href="kiwi.html" color="green">Kiwi</a><br />
        <a href="Persimmon" color="orange">Persimmon</a><br />
    </tr>
    """

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

th_node = soup.find('th', { 'scope' : 'row' })#OR th_node = soup.find('th', { 'scope' : 'row' },lambda x: x.text.startswith('Fruits'))
td_node = th_node.findNext('td')
fruits = td_node.find_all('a')
for f in fruits:
    print f['color'], " ", f.text

答案 2 :(得分:0)

您需要将class添加到href元素,正确的源代码如下:

from bs4 import BeautifulSoup

html = ""
html += "<table><th scope='row'>Fruits<br /><i><a href='#Fruits'>Buy</a></i></th>"
html += "<tr><td><a class='fruits' href='banana.html' color='yellow'>Banana</a><br/>"
html += "<a class='fruits' href='kiwi.html' color='green'>Kiwi</a><br/>"
html += "<a class='fruits' href='Persimmon' color='orange'>Persimmon</a><br/>"
html += "</tr></table>"

soup = BeautifulSoup(html,"html.parser")
for link in soup.findAll('a',{'class':'fruits'}):
    col = link.get('color')
    name = link.string
    print(col + " " + name)

答案 3 :(得分:0)

它不起作用的原因是beautifulsoup正在比较你的正则表达式:

>>> def f(s):
...     print "comparing", s
... 
>>> soup.find("th", text=f)
comparing None
None