抓取大型网站时出现HTTP错误524

时间:2018-03-02 17:58:54

标签: python beautifulsoup web-crawler python-3.5 urllib

我想抓取一个包含786页的网站。我的代码提取数据并将其保存到excel文件中。当我运行我的程序10页时,它工作正常,但当我尝试一次抓取整个786页时,它给了我这个错误:

Traceback (most recent call last):
  File "app.py", line 32, in <module>
    crawl_names(i)
  File "app.py", line 16, in crawl_names
    html = urlopen(rq)
  File "/usr/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/usr/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.5/urllib/request.py", line 510, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 524: Origin Time-out

我的代码:

from urllib.request import urlopen
from urllib.request import Request
from bs4 import BeautifulSoup
import re
import xlwt


headers={'User-Agent': 'Mozilla/5.0'}

book = xlwt.Workbook(encoding="utf-8")
sheet1 = book.add_sheet("Sheet 1")

def crawl_names(page):
    site = 'http://nex1music.ir/pages/' + str(page) + '/'
    rq = Request(site, headers=headers)
    html = urlopen(rq)
    bsObj = BeautifulSoup(html, "lxml")
    musics = bsObj.findAll("div", {"class": "pctn"})

    names = []
    for m in musics:
        names.append(m.findChild().findAll("a", {"href": re.compile("http\:\/\/nex1music\.ir\/tag\/.+\/")})[-1].string)



    for i in range(len(names)):
        sheet1.write(i, page-1, names[i])



for i in range(1, 786):
    crawl_names(i)

book.save("data.xls")

如何更改我的代码以抓取整个786页面?谢谢。

0 个答案:

没有答案
相关问题