我正在尝试从同一网站上的几篇文章中提取数据(标题,日期和文本)。每篇文章都有一个唯一的id,所以我使用while循环迭代id来获取每篇文章。文章'结构是一样的。
这就是我尝试提取数据的方式:
# import libraries
import csv
import urllib2
from bs4 import BeautifulSoup
# integer for first article id
articleid = 4449
articles = 4459
while articleid < articles:
# specify the url and article id
url = 'http://www.bkfrem.dk/default.asp?vis=nyheder&id='+str(articleid)
articleid += 1
# query the website and return the html to the variable
page = urllib2.urlopen(url)
# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'html.parser')
# create CSV file
csvfile = csv.writer(open('news.csv', 'w'))
csvfile.writerow(["Title", "Date", "Text"])
# take out the <div> of name and get its value and text
title_box = soup.find('h1', attrs={'style': 'margin-bottom:0px'})
title = title_box.text.encode('iso8859-15').strip()
date_box = soup.find('div', attrs={'style': 'font-style:italic; padding-bottom:10px'})
date = date_box.text.encode('iso8859-15').strip()
articleText_box = soup.find('div', attrs={'class': 'news'})
articleText = articleText_box.text.encode('iso8859-15').strip()
# print the data (encoded) to the CSV file
csvfile.writerow((title, date, articleText))
print title
print date
print articleText
所以我要做的是,如上所述,遍历所有文章ID,并将输出放在CSV文件中。第一篇文章的id为521,最后一篇文章的编号为4458.当我运行这个脚本时,我收到错误:
Traceback (most recent call last):
File "C:/Users/User/Desktop/articleScript.py", line 16, in <module>
page = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 421, in open
protocol = req.get_type()
File "C:\Python27\lib\urllib2.py", line 283, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: www.bkfrem.dk/default.asp?vis=nyheder&id=4458
我做错了什么?为什么它不能打开网址,当最后一篇文章ID为4458时,网址类型是如何未知的。