Question

非常熟悉Python，但我真的想学习它。我正在玩从网站上抓取数据，觉得我非常接近提出解决方案。问题是它只保留返回url的第一页，即使代码中的url正在改变每次迭代的页码。

我使用的网站是http://etfdb.com/etf/SPY/#etf-holdings&sort_name=weight&sort_order=desc&page=1，我试图抓取的具体数据表是SPY Holdings（其中有506个馆藏，然后列出苹果，微软等）。

正如您将注意到的，数据表有一堆页面（这会根据股票代码更改 - 但为了这个目的，请注意尽管有34页用于SPY，但它并不总是34页）。它首先显示15家公司，然后当您点击2（查看接下来的15个馆藏）时，网址页面会增加一个。

#to break up html
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import csv
import math

#goes to url - determines the number of holdings and the number of pages the data table will need to loop through
my_url = "http://etfdb.com/etf/SPY/#etf-
holdings&sort_name=weight&sort_order=desc&page=1"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html,"html.parser")
#goes to url - scrapes from another section of the page and finds 506 holdings
num_holdings_text = page_soup.find('span',{'class': 'relative-metric-bubble-data'})
num_holdings = num_holdings_text.text
number_of_loops = int(num_holdings)
num_of_loops = number_of_loops/15
#goes to url - because the table shows 15 holdings at a time, this calcs number of pages I'll need to loop through
num_of_loops = math.ceil(num_of_loops)
holdings = []
for loop in range(1,num_of_loops+1):
    my_url = "http://etfdb.com/etf/SPY/#etf-holdings&sort_name=weight&sort_order=desc&page=" + str(loop)
    uClient = uReq(my_url)
    page_html = uClient.read()
    uClient.close()
    page_soup = soup(page_html, "html.parser")
    table = page_soup.find('table', {
    'class': 'table mm-mobile-table table-module2 table-default table-striped table-hover table-pagination'})
    table_body = table.find('tbody')
    table_rows = table_body.find_all('tr')
    for tr in table_rows:
        td = tr.find_all('td')
        row = [i.text.strip() for i in td]
        holdings.append(row)
        print(row)
    print (holdings)


    with open('etfdatapull2.csv','w',newline='') as fp:
        a = csv.writer(fp, delimiter = ',')
        a.writerows(holdings)

同样，我遇到的问题是它只是不断返回第一页（例如，它总是只返回apple - GE），即使链接正在更新。

非常感谢你的帮助。再次，这是非常新的，所以请尽可能地愚蠢！

Answer 1

问题是，您尝试抓取的网站实际上是通过Javascript加载数据。如果您使用类似Chrome开发者工具的内容，您可以在第2页看到该网站，该网站引用了以下链接：

http://etfdb.com/data_set/?tm=1699&cond={by_etf:325}&no_null_sort=true&count_by_id=&sort=weight&order=desc&limit=15&offset=15

您正在寻找的数据是存在的;你的逻辑是合理的，但你只需要抓住上面的链接。

如果删除“offset”参数，并将限制更改为1000，您实际上会立即获取所有数据，并且可以完全删除分页。

希望有所帮助！

编辑：我应该指出，你加载的页面总是一样的（第一组条目，从AAPL开始），然后数据从上面的资源通过Javascript加载。然后，Javascript将替换您正在抓取的HTML内容。由于您的脚本查看原始HTML（但不执行Javascript替换内容），因此您会一遍又一遍地获得相同的表。

Python - 覆盖多个网址的网页抓取数据表

1 个答案: