在解析

时间:2016-10-18 02:28:44

标签: python-3.x

我正在尝试下载此练习的工作表,所有练习都在不同的日子分开。所有需要做的就是在链接末尾添加一个新数字。这是我的代码。



import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import os
theurl = "http://www.muscleandfitness.com/workouts/workout-routines/gain-10-pounds-muscle-4-weeks-1?day="
urls = []
count = 1
while count <29:
   urls.append(theurl + str(count))
   count +=1
print(urls)
for url in urls:
    thepage = urllib
    thepage = urllib.request.urlopen(urls)
    soup = BeautifulSoup(thepage,"html.parser")
    init_data = open('/Users/paribaker/Desktop/scrapping/workout/4weekdata.txt', 'a')
    workout = []

    for data_all in soup.findAll('div',{'class':"b-workout-program-day-exercises"}):
        try:
            for item in data_all.findAll('div',{'class':"b-workout-part--item"}):
                for desc in item.findAll('div', {'class':"b-workout-part--description"}):
                    workout.append(desc.find('h4',{'class':"b-workout-part--exercise-count"}).text.strip("\n") +",\t")
                    workout.append(desc.find('strong',{'class':"b-workout-part--promo-title"}).text +",\t")
                    workout.append(desc.find('span',{'class':"b-workout-part--equipment"}).text +",\t")
                for instr in item.findAll('div', {'class':"b-workout-part--instructions"}):
                    workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-sets"}).text.strip("\n") +",\t")
                    workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-reps"}).text.strip("\n") +",\t")
                    workout.append(instr.find('div',{'class':"b-workout-part--instructions--item workouts-rest"}).text.strip("\n"))
                    workout.append("\n*3")
        except  AttributeError:
            pass

init_data.write("".join(map(lambda x:str(x), workout)))
init_data.close
&#13;
&#13;
&#13;

问题是服务器超时,我假设它没有正确地遍历列表或添加我不需要的字符并且崩溃了服务器解析器。 我还尝试编写另一个脚本来抓取所有链接并将它们放在文本文档中,然后重新打开此脚本中的文本并遍历文本,但这也给了我同样的错误。你有什么想法?

1 个答案:

答案 0 :(得分:0)

这里有一个错字:

thepage = urllib.request.urlopen(urls)

你可能想要:

thepage = urllib.request.urlopen(url)

否则你试图打开一个网址数组而不是一个网址。