Question

我试图从新闻网站下载文字。 HTML是：

<div class="pane-content">
<div class="field field-type-text field-field-noticia-bajada">
<div class="field-items">
        <div class="field-item odd">
                 <p>"My Text" target="_blank">www.injuv.cl</a></strong></p>         </div>

输出应为：My Text 我使用以下python代码：

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = "My URL"
parsed_html = BeautifulSoup(html)
p = parsed_html.find("div", attrs={'class':'pane-content'})
print(p)

但代码的输出是：＆＃34;无＆＃34;。你知道我的代码有什么问题吗？

Answer 1

问题是你没有解析HTML，你正在解析URL字符串：

html = "My URL"
parsed_html = BeautifulSoup(html)

相反，您需要首先获取/检索/下载源，例如Python 2中：

from urllib2 import urlopen

html = urlopen("My URL")
parsed_html = BeautifulSoup(html)

在Python 3中，它将是：

from urllib.request import urlopen

html = urlopen("My URL")
parsed_html = BeautifulSoup(html)

或者，您可以使用第三方“for human”-style requests library：

import requests

html = requests.get("My URL").content
parsed_html = BeautifulSoup(html)

另请注意，您根本不应使用BeautifulSoup版本3 - 它不再维护。替换：

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup

只是：

from bs4 import BeautifulSoup

Answer 2

BeautifulSoup接受一串HTML。您需要使用URL从页面检索HTML。

查看urllib以发出HTTP请求。（或requests以更简单的方式。）检索HTML并将传递给BeautifulSoup，如下所示：

import urllib
from bs4 import BeautifulSoup

# Get the HTML
conn = urllib.urlopen("http://www.example.com")
html = conn.read()

# Give BeautifulSoup the HTML:
soup = BeautifulSoup(html)

从这里开始，只需按照之前的尝试进行解析。

p = soup.find("div", attrs={'class':'pane-content'})
print(p)

抓取新闻网站并获取新闻内容

2 个答案: