美丽的汤文刮

时间:2014-03-14 17:29:46

标签: python-2.7 beautifulsoup

我试图抓住文章正文中的所有p标签。我想知道是否有人可以解释为什么我的代码错了以及我如何改进它。以下是文章的URL和相关代码。感谢您提供的任何见解。

url:http://www.france24.com/en/20140310-libya-seize-north-korea-crude-oil-tanker-rebels-port-rebels/

import urllib2
from bs4 import BeautifulSoup

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

soup = BeautifulSoup(urllib2.urlopen(url).read())

# retrieve all of the paragraph tags
body = soup.find("div", {'class':'bd'}).get_text()
for tag in body:
    p = soup.find_all('p')
    print str(p) + '\n' + '\n'

1 个答案:

答案 0 :(得分:3)

问题是页面上有多个div标记class="bd"。看起来您需要包含实际文章的文章 - 它位于article标记内:

import urllib2
from bs4 import BeautifulSoup

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

soup = BeautifulSoup(urllib2.urlopen(url))

# retrieve all of the paragraph tags
paragraphs = soup.find('article').find("div", {'class': 'bd'}).find_all('p')
for paragraph in paragraphs:
    print paragraph.text

打印:

Libyan government forces on Monday seized a North Korea-flagged tanker after...
...

希望有所帮助。