Question

我想从这个网站自动保存城市数据：

http://www.dataforcities.org/

我使用beautifulsoup库从网页获取数据

http://open.dataforcities.org/details?4[]=2016

import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read())

如果我按照Web scraping with Python中的示例操作，则出现以下错误：

soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())

for row in soup('table', {'class': 'metrics'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string

IndexError                                Traceback (most recent call last)
<ipython-input-71-d688ff354182> in <module>()
----> 1 for row in soup('table', {'class': 'metrics'})[0].tbody('tr'):
      2     tds = row('td')
      3     print tds[0].string, tds[1].string

IndexError: list index out of range

  [1]: http://www.dataforcities.org/
  [2]: http://open.dataforcities.org/
  [3]: https://i.stack.imgur.com/qfQyG.png

Answer 1

从网站上快速浏览一下，这个方法的一个好方法是查看JS在页面上发出的请求。它将揭示用于收集数据以填充页面的内部API。

例如，对于特定城市，向http://open.dataforcities.org/city/109/themes/2017发出GET请求，其中包含包含许多条目的JSON响应。您可以使用requests

自行获取此信息

>>> import requests
>>> response = requests.get('http://open.dataforcities.org/city/109/themes/2017')
>>> response.json()
[{'theme': 'Economy', 'score': 108, 'date': '2015', 'rank': '2/9'}, {'theme': 'Education', 'score': 97, 'date': '2015', 'rank': '8/9'}, {'theme': 'Energy', 'score': 110, 'date': '2015', 'rank': '1/9'},

因此，通过一些工作，您可能会发现获取所需数据所需的所有端点。这只是一种方法。您还可以使用像selenium这样的浏览器自动化工具 - 不仅可以自动执行滚动和单击等浏览器操作，还可以执行任意JavaScript并检查js中的数据。

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://example.com/page/to/scrape')
value = driver.execute_script('return someThing.value;')

但是在尝试抓取网站之前遇到很多麻烦之前，您应该始终检查他们是否有可用的文档公共API。

Answer 2

您可以使用Python从网站上抓取数据，Beautifulsoup库帮助清理html代码并提取。 Thare也是其他图书馆。甚至NodeJs alsocan都这样做。

主要是你的逻辑。 Python和Beautifulsoup将为您提供数据。你必须分析并保存min db。

Beautiful Soup Documentation

其他要求， LXML，硒， Scrapy

实施例

from bs4 import BeautifulSoup import requests page = requests.get("http://www.dataforcities.org/") soup = BeautifulSoup(page.content, 'html.parser') all_links = soup.find_all(("a")

如上所述你可以找到任何东西。功能很多。教程 web scraping tutorial
python and beautifulsoup

最好还查看官方文档。

Python：是否可以抓取一个非常特殊的网页？

2 个答案: