如何解析两个不同的项目?

时间:2020-06-07 13:42:27

标签: scrapy

我正在使用scrapy 2.1来解析类别结果页面。

我想从该网站上抓取2种不同的东西:

  1. 类别信息,例如标题和URL
  2. 该类别页面中的产品项

第2个有效,但我在如何实现类别信息的存储方面苦苦挣扎。我的第一次尝试是创建另一个Item类CatItem:

class CatItem(scrapy.Item):
    title       = scrapy.Field() # char - 
    url         = scrapy.Field() # char - 
    level       = scrapy.Field() # int - 

class ProductItem(scrapy.Item):
    title       = scrapy.Field() # char - 

让我们解析页面:

def parse_item(self, response):

    # save category info
    category = CatItem()
    category['url']     = response.url
    category['title']   = response.url
    category['level']   = 1
    yield category

    # now let's parse all products within that category
    for selector in response.xpath("//article//ul/div[@data-qa-id='result-list-entry']"):

        product = ProductItem()
        product['title']          = selector.xpath(".//a/h2/text()").extract_first()
        yield product

我的管道:

class mysql_pipeline(object):
    def __init__(self):
        self.create_connection()

    def create_connection(self):
        settings = get_project_settings()

    def process_item(self, item, spider):
        self.store_db(item, spider)
        return item

现在我不知道如何进行。 process_item定义中只有一个“项目”。

如何将类别信息也传递给store_db方法?

1 个答案:

答案 0 :(得分:0)

您可以检查管道中的项目类型:

from your_project.items import CatItem, ProductItem

class YourPipeline(object):
...
    def process_item(self, item, spider):
        if isinstance(item, CatItem):
            save_category(item)
        return item

更新简单的PoC代码:

import scrapy
import csv
from scrapy.crawler import CrawlerProcess


class BooksPipeline(object):

    def process_item(self, item, spider):
        filename = None
        if isinstance(item, CategoryItem):
            filename = 'Categories.csv'
        elif isinstance(item, BookItem):
            filename = 'Books.csv'
        with open(filename, 'a', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=['Title', 'URL'], lineterminator="\n")
            writer.writerow(item)
        return item

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book_url in response.xpath('//ol/li//h3/a/@href').getall():
            yield scrapy.Request(
                url=response.urljoin(book_url),
                callback=self.parse_book,
            )

    def parse_book(self, response):
        category = CategoryItem()
        category['Title'] = response.xpath('//ul[@class="breadcrumb"]/li[last() - 1]/a/text()').get()
        category['URL'] = response.xpath('//ul[@class="breadcrumb"]/li[last() - 1]/a/@href').get()
        yield category

        book = BookItem()
        book['Title'] = response.xpath('//h1/text()').get()
        book['URL'] = response.url
        yield book

class BookItem(scrapy.Item):
    Title = scrapy.Field()
    URL = scrapy.Field()

class CategoryItem(scrapy.Item):
    Title = scrapy.Field()
    URL = scrapy.Field()

if __name__ == "__main__":
    process = CrawlerProcess(
        {
            'USER_AGENT': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36",
            'DOWNLOAD_TIMEOUT':100,
            'ITEM_PIPELINES': {
                '__main__.BooksPipeline': 300,
            }
        }
    )

    process.crawl(BooksSpider)
    process.start()