Question

我从下面给出的网站上抓取了一些数据。我无法在excel上获取此数据的输出。另外，我将我抓取的表格存储为字典。但是键和值对不同步。请有人帮忙。

P.PRODUCT_ID

Answer 1

您正在迭代列表并将其存储在同一变量中，该变量将在每次迭代时覆盖。尝试以下代码，我认为它会起作用。

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

url =requests.get("http://stats.espncricinfo.com/ci/content/records/307847.html" )
soup = bs(url.text, 'lxml')
soup_1 = soup.find(class_ = "recordsTable")
soup_pages = soup_1.find_all('a', href= True)

state_links =[]
state_id =[]
for link in soup_pages:
    state_links.append(link['href'])
    state_id.append(link.getText())

Total_dict = dict()

for a,year in zip(state_links,state_id):
    parse_link = "http://stats.espncricinfo.com"+a
    url_new = requests.get(parse_link)
    soup_new = bs(url_new.text, 'lxml')
    soup_table = soup_new.find(class_="engineTable")
    newdictlist = list()
    col_name =list()
    row_name =list()
    for col in soup_table.findAll('th'):
        col_name.append((col.text).lstrip().rstrip())
    for row in soup_table.findAll("td"):
        row_name.append(row.text.lstrip().rstrip())
    no_of_matches = len(row_name)/len(col_name)
    row_count=0
    for h in range(int(no_of_matches)):
        newdict = dict()
        for i in col_name:
            newdict[i] = row_name[row_count]
            row_count=row_count+1
        newdictlist.append(newdict)
    print(newdictlist)
    Total_dict[year] = newdictlist
print(Total_dict)

输出：{'1877'：[{'Team 1'：'Australia'，'Team 2'：'England'，'Winner'：'Australia'，'Margin'：'45 run'，'Ground' ：'墨尔本'，'比赛日期'：'1877年3月15-19日，'记分卡'：'测试＃1'}，{'团队1'：'澳大利亚'，'团队2'：'英格兰'，'获胜者”：“英格兰”，“保证金”：“ 4个小门”，“地面”：“墨尔本”，“比赛日期”：“ 1877年3月31日至4月4日”，“记分卡”：“测试2”}] ，['1879'：[{'Team 1'：'Australia'，'Team 2'：'England'，'Winner'：'Australia'，'Margin'：'10 wickets'，'Ground'：'Melbourne' ，“匹配日期”：“ 1879年1月2-4日”，“记分卡”：“测试＃3”}]，............}

Answer 2

您有2个循环，但是没有存储要添加到newdict的列名和行名。这是我的解决方案。请注意val_list的大小> key_list的大小

# create 2 lists to store key and value
key_list = []
val_list = []
newdict = dict()
for col in soup_table.findAll('th'):
    key_list.append((col.text).lstrip().rstrip())

for row in soup_table.findAll("td"):
    val_list.append(row.text.lstrip().rstrip())

index = 0
# loop key_list and add key pair to dict
for key in key_list:                    
    newdict[key] = val_list(index)
    index += 1
print(newdict)

将抓取的表存储为字典，输出为pandas DataFrame

2 个答案: