Python 2.7:如何正确提取文章数据

时间:2018-02-11 20:46:48

标签: python python-2.7 csv beautifulsoup

我正在尝试从同一网站上的几篇文章中提取数据(标题,日期和文本)。每篇文章都有一个唯一的id,所以我使用while循环迭代id来获取每篇文章。文章'结构是一样的。

这就是我尝试提取数据的方式:

# import libraries
import csv
import urllib2
from bs4 import BeautifulSoup

# integer for first article id
articleid = 4449
articles = 4459

while articleid < articles:
    # specify the url and article id
    url = 'http://www.bkfrem.dk/default.asp?vis=nyheder&id='+str(articleid)
    articleid += 1
    # query the website and return the html to the variable
    page = urllib2.urlopen(url)

    # parse the html using beautiful soup and store in variable soup
    soup = BeautifulSoup(page, 'html.parser')

    # create CSV file
    csvfile = csv.writer(open('news.csv', 'w'))
    csvfile.writerow(["Title", "Date", "Text"])

    # take out the <div> of name and get its value and text
    title_box = soup.find('h1', attrs={'style': 'margin-bottom:0px'})
    title = title_box.text.encode('iso8859-15').strip()
    date_box = soup.find('div', attrs={'style': 'font-style:italic; padding-bottom:10px'})
    date = date_box.text.encode('iso8859-15').strip()
    articleText_box = soup.find('div', attrs={'class': 'news'})
    articleText = articleText_box.text.encode('iso8859-15').strip()

    # print the data (encoded) to the CSV file
    csvfile.writerow((title, date, articleText))
    print title
    print date
    print articleText

所以我要做的是,如上所述,遍历所有文章ID,并将输出放在CSV文件中。第一篇文章的id为521,最后一篇文章的编号为4458.当我运行这个脚本时,我收到错误:

Traceback (most recent call last):
  File "C:/Users/User/Desktop/articleScript.py", line 16, in <module>
    page = urllib2.urlopen(url)
  File "C:\Python27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 421, in open
    protocol = req.get_type()
  File "C:\Python27\lib\urllib2.py", line 283, in get_type
    raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: www.bkfrem.dk/default.asp?vis=nyheder&id=4458

我做错了什么?为什么它不能打开网址,当最后一篇文章ID为4458时,网址类型是如何未知的。

这是开发人员工具的HTML代码 enter image description here

0 个答案:

没有答案