如何使用beautifulsoup解析表格行中的两个字符串?

时间:2017-12-09 12:41:55

标签: python beautifulsoup

html = '''
<div class="container">
 <h2>Countries & Capitals</h2>
  <table class="two-column td-red">
  <thead><tr><th>Country</th><th>Capital city</th></tr></thead><tbody>
   <tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
   <tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>

鉴于这个HTML,我想专门解析国家名称和首都城市名称并将它们放入字典中以便我可以获得

dict["Afghanistan] = 'Kabul'

我已经开始做

soup = BeautifulSoup(open(filename), 'lxml')
countries = {}
# YOUR CODE HERE
table = soup.find_all('table')
for each in table:
    if each.find('tr'):
        continue
    else:
        print(each.prettify())
return countries

但由于它是第一次使用它而令人困惑。

3 个答案:

答案 0 :(得分:0)

解决方案适用于给定的html示例:

from bs4 import BeautifulSoup  # assuming you did pip install bs4
soup = BeautifulSoup(html, "html.parser")  # the html you mentioned
table_data = soup.find('table')
data = {}  # {'country': 'capital'} dict
for row in table_data.find_all('tr'):
    row_data = row.find_all('td')
    if row_data:
        data[row_data[0].text] = row_data[1].text

对于任何错误的情况,我都跳过try, except块。我建议通过BeautifulSoup的documentation,它涵盖了所有内容。

答案 1 :(得分:0)

您可以选择&#34; tr&#34;元素,如果他们有两个&#34; td&#34;您拥有数据的子元素:

from bs4 import BeautifulSoup

html = """
<div class="container">
 <h2>Countries & Capitals</h2>
  <table class="two-column td-red">
  <thead><tr><th>Country</th><th>Capital city</th></tr></thead><tbody>
   <tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
   <tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
countries = {}

trs = soup.find_all('tr')
for tr in trs:
    tds = tr.find_all("td")
    if len (tds) ==2:
        countries[tds[0].text] = tds[1].text
print (countries)

输出:

{'Afghanistan': 'Kabul', 'Albania': 'Tirana'}

答案 2 :(得分:0)

这个怎么样:

from bs4 import BeautifulSoup

element ='''
<div class="container">
    <h2>Countries & Capitals</h2>
    <table class="two-column td-red">
        <thead>
            <tr><th>Country</th><th>Capital city</th></tr>
        </thead>
        <tbody>
            <tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
            <tr><td>Albania</td><td>Tirana</td></tr>
        </tbody>
    </table>
</div>
'''
soup = BeautifulSoup(element, 'lxml')

countries = {}
for data in soup.select("tr"):
    elem = [item.text for item in data.select("th,td")]
    countries[elem[0]] = elem[1]

print(countries)

输出:

{'Afghanistan': 'Kabul', 'Country': 'Capital city', 'Albania': 'Tirana'}