Question

如何使用单个抓取工具从多个域抓取数据。我使用漂亮的汤完成了对单个网站的爬行，但无法弄清楚如何创建一个通用的网站。

Answer 1

这个问题是有缺陷的，你要抓的网站必须有一些共同点。

from bs4 import BeautifulSoup
from urllib import request
import urllib.request

for counter in range(0,10):        
    # site = input("Type the name of your website") Python 3+
    site = raw_input("Type the name of your website")
    # Takes the website you typed and stores it in > site < variable
    make_request_to_site = request.urlopen(site).read()
    # Makes a request to the site that we stored in a var
    soup = BeautifulSoup(make_request_to_site, "html.parser")
    # We pass it through BeautifulSoup parser in this case html.parser
    # Next we make a loop to find all links in the site that we stored
    for link in soup.findAll('a'):
        print link['href']

Answer 2

如上所述，每个站点都有自己独特的选择器设置（，等）。单个通用爬虫无法进入网址并直观地了解要抓取的内容。

BeautifulSoup可能不是此类请求的最佳选择。 Scrapy是另一个网络爬虫库，它比BS4更强大。

有关stackoverflow的类似问题：Scrapy approach to scraping multiple URLs

Scrapy文档： https://doc.scrapy.org/en/latest/intro/tutorial.html

如何使用单个爬虫爬网多个域？

2 个答案: