Question

我正在尝试使用BeautifulSoup从html file中的两个html表中提取一些数据。

这实际上是我第一次使用它并且我搜索了很多问题/例子，但似乎没有一个在我的情况下工作。 html包含两个表，第一个包含第一列的标题（总是文本），第二个包含以下列的数据。此外，该表包含文本，数字和符号。这让像我这样的新手更加复杂。 Here's从浏览器复制的html布局我能够提取行的整个html内容，但仅针对第一个表，所以实际上我没有获得任何数据，只有第一列的内容。

我想要获得的输出是一个字符串，其中包含表格的“联合”信息（Col1 = text，Col2 = number，Col3 = number，Col4 = number，Col5 = number），例如：

Canada, 6, 5, 2, 1

以下是每个项目的X路径列表：

"Canada": /html/body/div/div[1]/table/tbody[2]/tr[2]/td/div/a
"6": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[1] 
"5": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[3] 
"2": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[5]
"1": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[7]

我会对“粗糙”html格式的字符串感到满意，只要每行有一个字符串，这样我就能用我已经知道的方法进一步解析它。这是我到目前为止的代码。谢谢！

from BeautifulSoup import BeautifulSoup
html=""" 
my html code
"""
soup = BeautifulSoup(html)
table=soup.find("table")
for row in table.findAll('tr'):
    col = row.findAll('td')
    print row, col

Answer 1

您似乎正在从http://www.appannie.com抓取数据。

以下是获取数据的代码。我确信代码的某些部分可以改进或以pythonic方式编写。但它得到你想要的。另外，我使用了Beautiful Soup 4而不是3。

from bs4 import BeautifulSoup

html_file = open('test2.html')
soup = BeautifulSoup(html_file)

countries = []
countries_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[1]
countries_body = countries_table.find_all('tbody')[1]
countries_row = countries_body.find_all('tr', attrs={"class": "ranks"})
for row in countries_row:
    countries.append(row.div.a.text)

data = []
data_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[3]
data_body = data_table.find_all('tbody')[1]
data_row = data_body.find_all('tr', attrs={"class": "ranks"})
for row in data_row:
    tds = row.find_all('td')
    sublist = []
    for td in tds[::2]:    
        sublist.append(td.text)
    data.append(sublist)

for element in zip(countries, data):
    print element

希望这会有所帮助：）

Answer 2

使用bs4，但这应该有效：

from bs4 import BeautifulSoup as bsoup

ofile = open("htmlsample.html")
soup = bsoup(ofile)
soup.prettify()

tables = soup.find_all("tbody")

storeTable = tables[0].find_all("tr")
storeValueRows = tables[2].find_all("tr")

storeRank = []
for row in storeTable:
    storeRank.append(row.get_text().strip())

storeMatrix = []
for row in storeValueRows:
    storeMatrixRow = []
    for cell in row.find_all("td")[::2]:
        storeMatrixRow.append(cell.get_text().strip())
    storeMatrix.append(", ".join(storeMatrixRow))

for record in zip(storeRank, storeMatrix):
    print " ".join(record)

以上将打印出来：

# of countries - rank 1 reached 0, 0, 1, 9
# of countries - rank 5 reached 0, 8, 49, 29
# of countries - rank 10 reached 25, 31, 49, 32
# of countries - rank 100 reached 49, 49, 49, 32
# of countries - rank 500 reached 49, 49, 49, 32
# of countries - rank 1000 reached 49, 49, 49, 32
[Finished in 0.5s]

将storeTable更改为tables[1]，将storeValueRows更改为tables[3]将打印出来：

Country 
Canada 6, 5, 2, 1
Brazil 7, 5, 2, 1
Hungary 7, 6, 2, 2
Sweden 9, 5, 1, 1
Malaysia 10, 5, 2, 1
Mexico 10, 5, 2, 2
Greece 10, 6, 2, 1
Israel 10, 6, 2, 1
Bulgaria 10, 6, 2, -
Chile 10, 6, 2, -
Vietnam 10, 6, 2, -
Ireland 10, 6, 2, -
Kuwait 10, 6, 2, -
Finland 10, 7, 2, -
United Arab Emirates 10, 7, 2, -
Argentina 10, 7, 2, -
Slovakia 10, 7, 2, -
Romania 10, 8, 2, -
Belgium 10, 9, 2, 3
New Zealand 10, 13, 2, -
Portugal 10, 14, 2, -
Indonesia 10, 14, 2, -
South Africa 10, 15, 2, -
Ukraine 10, 15, 2, -
Philippines 10, 16, 2, -
United Kingdom 11, 5, 2, 1
Denmark 11, 6, 2, 2
Australia 12, 9, 2, 3
United States 13, 9, 2, 2
Austria 13, 9, 2, 3
Turkey 14, 5, 2, 1
Egypt 14, 5, 2, 1
Netherlands 14, 8, 2, 2
Spain 14, 11, 2, 4
Thailand 15, 10, 2, 3
Singapore 16, 10, 2, 2
Switzerland 16, 10, 2, 3
Taiwan 17, 12, 2, 4
Poland 17, 13, 2, 5
France 18, 8, 2, 3
Czech Republic 18, 13, 2, 6
Germany 19, 11, 2, 3
Norway 20, 14, 2, 5
India 20, 14, 2, 5
Italy 20, 15, 2, 7
Hong Kong 26, 21, 2, -
Japan 33, 16, 4, 5
Russia 33, 17, 2, 7
South Korea 46, 27, 2, 5
[Finished in 0.6s]

不是最好的代码，可以进一步改进。但是，这种逻辑很适用。

希望这有帮助。

修改

如果您想要格式为South Korea, 46, 27, 2, 5而不是South Korea 46, 27, 2, 5（请注意国家/地区名称后面的,），只需更改此内容：

storeRank.append(row.get_text().strip())

到此：

storeRank.append(row.get_text().strip() + ",")

Answer 3

以为我会在这里放置我的备用版本。我甚至不知道为什么人们仍然使用Beautifulsoup进行网页抓取，通过LXML更容易直接使用XPath。这是同样的问题，可能是一个更容易阅读和更新的形式：

from lxml import html, etree

tree = html.parse("sample.html").xpath('//body/div/div')

lxml_getData = lambda x: "{}, {}, {}, {}".format(lxml_getValue(x.xpath('.//td')[0]), lxml_getValue(x.xpath('.//td')[2]), lxml_getValue(x.xpath('.//td')[4]), lxml_getValue(x.xpath('.//td')[6]))
lxml_getValue = lambda x: etree.tostring(x, method="text", encoding='UTF-8').strip()

locations = tree[0].xpath('.//tbody')[1].xpath('./tr')
locations.pop(0) # Don't need first row
data = tree[1].xpath('.//tbody')[1].xpath('./tr')
data.pop(0) # Don't need first row

for f, b in zip(locations, data):
    print(lxml_getValue(f), lxml_getData(b))

BeautifulSoup从多个表中提取数据

3 个答案: