网页使用Scrapy和Python从网站上搜集所有网址

时间:2018-03-29 15:49:23

标签: python list dictionary scrapy

我正在编写一个网络抓取工具来获取一组链接

  

(位于tree.xpath('// div [@ class =“work_area_content”] / a / @ href')

从网站

返回由leafs父级划分的所有叶子的Title and Url。我有两个刮刀:一个在python中,另一个在Scrapy中用于Python。 Scrapy Request方法中callbacks的目的是什么?信息应该是multidimensional or single dimension list(我相信是多维的,但会增强复杂性)?以下哪个代码更好?如果刮刀代码更好,我如何将python代码迁移到Scrapy代码?

从我对回调的理解是,它将函数的参数传递给另一个函数;但是,如果回调引用自身,则数据会被覆盖并因此丢失,并且您无法返回到根数据。它是否正确?

蟒:

url_storage = [ [ [ [] ] ] ]

page = requests.get('http://1.1.1.1:1234/TestSuites')
tree = html.fromstring(page.content)
urls = tree.xpath('//div[@class="work_area_content"]/a/@href').extract()

i = 0
j = 0
k = 0

for i, url in enumerate(urls):
    absolute_url = "".join(['http://1.1.1.1:1234/', url])       
    url_storage[i][j][k].append(absolute_url)   
    print(url_storage)
    #url_storage.insert(i, absolute_url)
    page = requests.get(url_storage[i][j][k])
    tree2 = html.fromstring(page.content)
    urls2 = tree2.xpath('//div[@class="work_area_content"]/a/@href').extract()
    for j, url2 in enumerate(urls2):
        absolute_url = "".join(['http://1.1.1.1:1234/', url2])
        url_storage[i][j][k].append(absolute_url)
        page = requests.get(url_storage[i][j][k])
        tree3 = html.fromstring(page.content)    
        urls3 = tree3.xpath('//div[@class="work_area_content"]/a/@href').extract()
            for k, url3 in enumerate(urls3):
                absolute_url = "".join(['http://1.1.1.1:1234/', url3])
                url_storage[i][j][k].append(absolute_url)
                page = requests.get(url_storage[i][j][k])
                tree4 = html.fromstring(page.content)    
                urls3 = tree4.xpath('//div[@class="work_area_content"]/a/@href').extract()
                title = tree4.xpath('//span[@class="page_title"]/text()').extract()
                yield Request(url_storage[i][j][k], callback=self.end_page_parse_TS, meta={"Title": title, "URL": urls3 })
                #yield Request(absolute_url, callback=self.end_page_parse_TC, meta={"Title": title, "URL": urls3 })

def end_page_parse_TS(self, response):
    print(response.body)
    url = response.meta.get('URL')
    title = response.meta.get('Title')

    yield{'URL': url, 'Title': title}

def end_page_parse_TC(self, response):
    url = response.meta.get('URL')
    title = response.meta.get('Title')
    description = response.meta.get('Description')

    description = response.xpath('//table[@class="wiki_table]/tbody[contains(/td/text(), "description")/parent').extract()    
    yield{'URL': url, 'Title': title, 'Description':description}

Scrapy:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from datablogger_scraper.items import DatabloggerScraperItem


class DatabloggerSpider(CrawlSpider):
    # The name of the spider
    name = "datablogger"

    # The domains that are allowed (links to other domains are skipped)
    allowed_domains = ['http://1.1.1.1:1234/']

    # The URLs to start with
    start_urls = ['http://1.1.1.1:1234/TestSuites']

    # This spider has one rule: extract all (unique and canonicalized) links, follow them and parse them using the parse_items method
    rules = [
        Rule(
            LinkExtractor(
                canonicalize=True,
                unique=True
            ),
            follow=True,
            callback="parse_items"
        )
    ]

    # Method which starts the requests by visiting all URLs specified in start_urls
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, dont_filter=True)

    # Method for parsing items
    def parse_items(self, response):
        # The list of items that are found on the particular page
        items = []
        # Only extract canonicalized and unique links (with respect to the current page)
        links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
        # Now go through all the found links
        item = DatabloggerScraperItem()
        item['url_from'] = response.url
        for link in links:
            item['url_to'] = link.url
            items.append(item)
        # Return all the found items
        return items

0 个答案:

没有答案
相关问题