Question

我正在尝试浏览网站的所有网页，并提取某个标记/类的所有实例。

它似乎是一遍又一遍地从同一页面中提取信息，但我不确定原因，因为len(urls) #The stack of URL's being scraped中有一个钟形曲线变化，这让我想到了我＆＃ 39; m至少爬过链接，但我可能会不正确地拉出/打印信息。

import urllib
import urlparse
import re
from bs4 import BeautifulSoup

url = "http://weedmaps.com"

如果我尝试仅使用基本的weedmaps.com网址，则不会打印任何内容，但如果我从一个页面开始，该页面包含我正在寻找... url = "https://weedmaps.com/dispensaries/shakeandbake"的数据类型，那么它会拉动信息输出，但它会一遍又一遍地打印相同的信息。

urls = [url] # Stack of urls to scrape
visited = [url] # Record of scraped urls
htmltext = urllib.urlopen(urls[0]).read()

# While stack of urls is greater than 0, keep scraping for links
while len(urls) > 0:
    try:
        htmltext = urllib.urlopen(urls[0]).read()

# Except for visited urls
    except:
        print urls[0]  

# Get and Print Information
    soup = BeautifulSoup(htmltext)
    urls.pop(0) 
    info = soup.findAll("div", {"class":"story-heading"})

    print info

# Number of URLs in stack
    print len(urls)

# Append Incomplete Tags    
    for tag in soup.findAll('a',href=True):
        tag['href'] = urlparse.urljoin(url,tag['href'])
        if url in tag['href'] and tag['href'] not in visited:
            urls.append(tag['href'])
            visited.append(tag['href'])

Answer 1

您当前代码的问题在于您放入队列的网址（urls）指向同一页面，但指向不同的锚点，例如：

换句话说，tag['href'] not in visited条件不会过滤指向同一页面的不同网址，而是过滤不同的网址。

从我看到的，你重塑网络抓取框架。但是已经有一个可以节省您的时间，使您的网络抓取代码组织和清洁，并使其显着快于您当前的解决方案 - Scrapy。

您需要CrawlSpider，配置rules以关注链接，例如：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class MachineSpider(CrawlSpider):
    name = 'weedmaps'
    allowed_domains = ['weedmaps.com']
    start_urls = ['https://weedmaps.com/dispensaries/shakeandbake']

    rules = [
        Rule(LinkExtractor(allow=r'/dispensaries/'), callback='parse_hours')
    ]

    def parse_hours(self, response):
        print response.url

        for hours in response.css('span[itemid="#store"] div.row.hours-row div.col-md-9'):
            print hours.xpath('text()').extract()

而不是打印，您的回调应该返回或生成Item个实例，以后可以以不同的方式在管道中保存到文件或数据库或进程。

获取Web机器人正确抓取网站的所有页面

1 个答案: