如何使用单个爬虫爬网多个域?

时间:2017-03-04 12:33:04

标签: python-2.7 web-scraping beautifulsoup web-crawler

如何使用单个抓取工具从多个域抓取数据。我使用漂亮的汤完成了对单个网站的爬行,但无法弄清楚如何创建一个通用的网站。

2 个答案:

答案 0 :(得分:0)

这个问题是有缺陷的,你要抓的网站必须有一些共同点。

from bs4 import BeautifulSoup
from urllib import request
import urllib.request

for counter in range(0,10):        
    # site = input("Type the name of your website") Python 3+
    site = raw_input("Type the name of your website")
    # Takes the website you typed and stores it in > site < variable
    make_request_to_site = request.urlopen(site).read()
    # Makes a request to the site that we stored in a var
    soup = BeautifulSoup(make_request_to_site, "html.parser")
    # We pass it through BeautifulSoup parser in this case html.parser
    # Next we make a loop to find all links in the site that we stored
    for link in soup.findAll('a'):
        print link['href']

答案 1 :(得分:0)

如上所述,每个站点都有自己独特的选择器设置(,等)。单个通用爬虫无法进入网址并直观地了解要抓取的内容。

BeautifulSoup可能不是此类请求的最佳选择。 Scrapy是另一个网络爬虫库,它比BS4更强大。

有关stackoverflow的类似问题:Scrapy approach to scraping multiple URLs

Scrapy文档: https://doc.scrapy.org/en/latest/intro/tutorial.html