Scrapy SitemapSpider只对一件物品进行了整理和整理

时间:2017-06-02 11:55:47

标签: python scrapy

我正在使用Option Explicit Sub CreatePivotTable() Dim sht1 As Worksheet: Set sht1 = Worksheets("Sheet1") Dim sht2 As Worksheet: Set sht2 = Worksheets("Sheet2") Dim lastRow As Long Dim pvtCache As PivotCache Dim pvt As PivotTable Dim strt As String, src As String With sht1 lastRow = .Cells(.Rows.Count, "A").End(xlUp).Row src = .Name & "!" & Range("A3:C" & lastRow).Address End With strt = sht2.Name & "!" & sht2.Range("A1").Address 'Create Pivot Cache from Source Data Set pvtCache = ActiveWorkbook.PivotCaches.Create( _ SourceType:=xlDatabase, SourceData:=src) 'Create Pivot table from Pivot Cache Set pvt = pvtCache.CreatePivotTable(TableDestination:=sht2.Range(strt), _ TableName:="PivotTable1") 'Add fields With pvt With .PivotFields("Customer ID") .Orientation = xlRowField .Position = 1 End With With .PivotFields("Amount Purchased") .Orientation = xlDataField .Position = 1 End With .ColumnGrand = False .RowGrand = False End With End Sub Sub DeletePivotTable() ThisWorkbook.Sheets("Sheet2").PivotTables(1).TableRange2.Clear End Sub 运行一个刮刀,目前已下载了14,550件物品。然而,在某些时候它似乎已经被卡住了。有人提到了“损失”。在下载中。由于刮刀在设置中指定了FilesPipeline,我尝试停止并重新启动它。

然而,奇怪的是,重新启动它时会遇到dupefilter中的一个项目并完成(参见下面的日志)。我不知道蜘蛛为什么会这样做;任何人都可以指出我正确的调试方向吗?

WORKDIR

以下是蜘蛛的一些细节。使用scraper_1 | Tor appears to be working. Proceeding with command... scraper_1 | 2017-06-02 11:38:20 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: apkmirror_scraper) scraper_1 | 2017-06-02 11:38:20 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'apkmirror_scraper', 'NEWSPIDER_MODULE': 'apkmirror_scraper.spiders', 'SPIDER_MODULES': ['apkmirror_scraper.spiders']} scraper_1 | 2017-06-02 11:38:20 [apkmirror_scraper.extensions] INFO: The crawler will scrape the following (randomized) number of items before changing identity: 32 scraper_1 | 2017-06-02 11:38:20 [scrapy.middleware] INFO: Enabled extensions: scraper_1 | ['scrapy.extensions.corestats.CoreStats', scraper_1 | 'scrapy.extensions.telnet.TelnetConsole', scraper_1 | 'scrapy.extensions.memusage.MemoryUsage', scraper_1 | 'scrapy.extensions.closespider.CloseSpider', scraper_1 | 'scrapy.extensions.feedexport.FeedExporter', scraper_1 | 'scrapy.extensions.logstats.LogStats', scraper_1 | 'scrapy.extensions.spiderstate.SpiderState', scraper_1 | 'apkmirror_scraper.extensions.TorRenewIdentity'] scraper_1 | 2017-06-02 11:38:20 [scrapy.middleware] INFO: Enabled downloader middlewares: scraper_1 | ['scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', scraper_1 | 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', scraper_1 | 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', scraper_1 | 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', scraper_1 | 'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware', scraper_1 | 'scrapy.downloadermiddlewares.retry.RetryMiddleware', scraper_1 | 'apkmirror_scraper.downloadermiddlewares.TorRetryMiddleware', scraper_1 | 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', scraper_1 | 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', scraper_1 | 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', scraper_1 | 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', scraper_1 | 'scrapy.downloadermiddlewares.stats.DownloaderStats'] scraper_1 | 2017-06-02 11:38:20 [scrapy.middleware] INFO: Enabled spider middlewares: scraper_1 | ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', scraper_1 | 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', scraper_1 | 'scrapy.spidermiddlewares.referer.RefererMiddleware', scraper_1 | 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', scraper_1 | 'scrapy.spidermiddlewares.depth.DepthMiddleware'] scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: env scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: assume-role scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: shared-credentials-file scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/endpoints.json scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/s3/2006-03-01/service-2.json scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/_retry.json scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Registering retry handlers for service: s3 scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_post at 0x7f9739657a60> scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x7f9739657840> scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Switching signature version for service s3 to version s3v4 based on config file override. scraper_1 | 2017-06-02 11:38:21 [botocore.endpoint] DEBUG: Setting s3 timeout as (60, 60) scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Defaulting to S3 virtual host style addressing with path style addressing fallback. scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: env scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: assume-role scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] DEBUG: Looking for credentials via: shared-credentials-file scraper_1 | 2017-06-02 11:38:21 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/endpoints.json scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/s3/2006-03-01/service-2.json scraper_1 | 2017-06-02 11:38:21 [botocore.loaders] DEBUG: Loading JSON file: /usr/local/lib/python3.6/site-packages/botocore/data/_retry.json scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Registering retry handlers for service: s3 scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_post at 0x7f9739657a60> scraper_1 | 2017-06-02 11:38:21 [botocore.hooks] DEBUG: Event creating-client-class.s3: calling handler <function add_generate_presigned_url at 0x7f9739657840> scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Switching signature version for service s3 to version s3v4 based on config file override. scraper_1 | 2017-06-02 11:38:21 [botocore.endpoint] DEBUG: Setting s3 timeout as (60, 60) scraper_1 | 2017-06-02 11:38:21 [botocore.client] DEBUG: Defaulting to S3 virtual host style addressing with path style addressing fallback. scraper_1 | 2017-06-02 11:38:21 [scrapy.middleware] INFO: Enabled item pipelines: scraper_1 | ['scrapy.pipelines.images.ImagesPipeline', scraper_1 | 'scrapy.pipelines.files.FilesPipeline'] scraper_1 | 2017-06-02 11:38:21 [scrapy.core.engine] INFO: Spider opened scraper_1 | 2017-06-02 11:38:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) scraper_1 | 2017-06-02 11:38:21 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 scraper_1 | 2017-06-02 11:38:21 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.apkmirror.com/sitemap_index.xml> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) scraper_1 | 2017-06-02 11:38:21 [scrapy.core.engine] INFO: Closing spider (finished) scraper_1 | 2017-06-02 11:38:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats: scraper_1 | {'dupefilter/filtered': 1, scraper_1 | 'finish_reason': 'finished', scraper_1 | 'finish_time': datetime.datetime(2017, 6, 2, 11, 38, 21, 946421), scraper_1 | 'log_count/DEBUG': 26, scraper_1 | 'log_count/INFO': 10, scraper_1 | 'memusage/max': 73805824, scraper_1 | 'memusage/startup': 73805824, scraper_1 | 'start_time': datetime.datetime(2017, 6, 2, 11, 38, 21, 890151)} scraper_1 | 2017-06-02 11:38:21 [scrapy.core.engine] INFO: Spider closed (finished) apkmirrorscrapercompose_scraper_1 exited with code 0

是刮刀apkmirror.com
SitemapSpider

我已经覆盖了dupefilter类,如下所示:

from scrapy.spiders import SitemapSpider
from apkmirror_scraper.spiders.base_spider import BaseSpider


class ApkmirrorSitemapSpider(SitemapSpider, BaseSpider):
    name = 'apkmirror'
    sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml']
    sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]

    custom_settings = {
        'CLOSESPIDER_PAGECOUNT': 0,
        'CLOSESPIDER_ERRORCOUNT': 1,
        'CONCURRENT_REQUESTS': 32,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 16,
        'TOR_RENEW_IDENTITY_ENABLED': True,
        'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 50,
        'FEED_URI': '/scraper/apkmirror_scraper/data/apkmirror.json',
        'FEED_FORMAT': 'json',
        'DUPEFILTER_CLASS': 'apkmirror_scraper.dupefilters.URLDupefilter',
    }

    download_timeout = 60 * 15.0        # Allow 15 minutes for downloading APKs

1 个答案:

答案 0 :(得分:2)

看似.bean(Splitter.class,"saveFile("${camelContext.properties[mySplitSize]}, ${camelContext.properties[mySplitIndex]}, ${camelContext.properties[mySplitComplete]})") 的{​​{1}} does NOT set dont_filter=True,与default Spider class相反。

因此,实际上,重新启动抓取时,SitemapSpider可能在您的工作目录中“已经访问过”,因此已过滤。

您可以覆盖start_requests()的{​​{1}}来设置http://www.apkmirror.com/sitemap_index.xml。你也可以在scrapy中打开bug。