scrapy-splash不返回由splash处理的html

时间:2019-04-02 08:07:34

标签: python scrapy splash scrapy-splash

我已按照README文件中所述的说明(包括中间件设置等)在python虚拟环境(Ubuntu 16.04)中安装了splash和scrapy-splash。即使,我没有在日志文件中收到任何错误(显然),由ScrapySplash返回的html不包含由Splash处理的html,仅包含由Scrapy下载的html(不使用splash)。

在某些情况下,我可以获取正确的html。这些是:

但是,scrapy-splash不能使用SplashRequest返回正确的HTML:

yield SplashRequest(url, self.parse, endpoint='render.html', args={'wait': 0.5})

这是我在settings.py文件中的配置:

SPIDER_MIDDLEWARES = {
  'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DOWNLOADER_MIDDLEWARES = {
  'scrapy_splash.SplashCookiesMiddleware': 723,
  'scrapy_splash.SplashMiddleware': 725,
  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPLASH_URL = 'http://127.0.0.1:8050/'
SPLASH_COOKIES_DEBUG = True

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

我希望通过splash处理html的输出,但是它只返回html而不进行处理。

启动docker消息

process 1: D-Bus library appears to be incorrectly set up; failed to read machine uuid: UUID file '/etc/machine-id' should contain a hex string of length 32, not length 0, with no other text
See the manual page for dbus-uuidgen to correct this issue.
qt.network.ssl: QSslSocket: cannot resolve SSLv2_client_method
qt.network.ssl: QSslSocket: cannot resolve SSLv2_server_method
2019-04-17 14:35:28.198194 [events] {"timestamp": 1555511728, "status_code": 200, "user-agent": "Scrapy/1.3.3 (+http://scrapy.org)", "client_ip": "172.17.0.1", "load": [0.15, 0.38, 0.35], "rendertime": 5.785578966140747, "active": 0, "fds": 68, "qsize": 0, "method": "POST", "_id": 140284272664528, "path": "/render.html", "args": {"headers": {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "User-Agent": "Scrapy/1.3.3 (+http://scrapy.org)", "Accept-Language": "en", "Cookie": "__cfduid=d035cc38f38ee9f555aec777db4b1b8f81555511718"}, "uid": 140284272664528, "wait": 0.5, "url": "https://www.tampabay.com/events/"}, "maxrss": 159672}
2019-04-17 14:35:28.198893 [-] "172.17.0.1" - - [17/Apr/2019:14:35:27 +0000] "POST /render.html HTTP/1.1" 200 34075 "-" "Scrapy/1.3.3 (+http://scrapy.org)

Scrapy LOG消息:

2019-04-17 16:35:18 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tampabay)
2019-04-17 16:35:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tampabay.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['tampabay.spiders'], 'BOT_NAME': 'tampabay', 'LOG_FILE': 'tampabay.log', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'DOWNLOAD_DELAY': 3}
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-17 16:35:18 [scrapy.middleware] INFO: Enabled item pipelines:
['tampabay.pipelines.TampabayPipeline']
2019-04-17 16:35:18 [scrapy.core.engine] INFO: Spider opened
2019-04-17 16:35:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-17 16:35:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-17 16:35:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tampabay.com/robots.txt> (referer: None)
2019-04-17 16:35:18 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://127.0.0.1:8050/robots.txt> (referer: None)
2019-04-17 16:35:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tampabay.com/events/ via http://127.0.0.1:8050/render.html> (referer: None)
2019-04-17 16:35:28 [tampabay] DEBUG: ############## INSIDE FUNCTION -> parse ############### 
2019-04-17 16:35:28 [tampabay] DEBUG: EVENTS: 0
2019-04-17 16:35:28 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-17 16:35:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1037,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 2,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 35911,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 17, 14, 35, 28, 333825),
 'log_count/DEBUG': 6,
 'log_count/INFO': 7,
 'response_received_count': 3,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/render.html/request_count': 1,
 'splash/render.html/response_count/200': 1,
 'start_time': datetime.datetime(2019, 4, 17, 14, 35, 18, 83737)}
2019-04-17 16:35:28 [scrapy.core.engine] INFO: Spider closed (finished)

0 个答案:

没有答案