Question

我正试图从中文报纸数据库中搜集文章。以下是一些源代码（粘贴摘录b / c键控网站）：

<base href="http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/web\" /><html>
<! -- <%@ page contentType="text/html;charset=GBK" %>
<head>
<meta http-equiv="Content-Language" content="zh-cn">
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>概览页面</title>
...
</head>
...
</html>  
</html>

当我尝试直接刮擦表格中的链接时如下：

import urllib, urllib2, re, mechanize
from BeautifulSoup import BeautifulSoup
br = mechanize.Browser(factory=mechanize.RobustFactory())
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6')]
br.set_handle_robots(False)

url = 'http://huylpd.twinbridge.com.ezp-prod1.hul.harvard.edu/search?%C8%D5%C6%DA=&%B1%EA%CC%E2=&%B0%E6%B4%CE=&%B0%E6%C3%FB=&%D7%F7%D5%DF=&%D7%A8%C0%B8=&%D5%FD%CE%C4=%B9%FA%BC%CA%B9%D8%CF%B5&Relation=AND&sortfield=RELEVANCE&image1.x=27&image1.y=16&searchword=%D5%FD%CE%C4%3D%28%B9%FA%BC%CA%B9%D8%CF%B5%29&presearchword=%B9%FA%BC%CA%B9%D8%CF%B5&channelid=16380'
page = br.open(url)
soup = BeautifulSoup(page)
links = soup.findAll('a') # links is empty =(

Python甚至没有在html中找到任何内容，也就是返回一个空列表。我认为这是因为源代码以base href标记开头，而Python只识别文档中的两个标记：base href和html。

知道在这种情况下如何刮掉链接？非常感谢你!!

Answer 1

删除第二行使BS找到所有标签。我没有找到更好的解析方法。

page = br.open(url)
page = page.read().replace('<! -- <%@ page contentType="text/html;charset=GBK" %>', '')
soup = BeautifulSoup(page)

Answer 2

当你的html搞砸了时，最好先把它清理一下，例如，在这种情况下，删除之前的所有内容，删除之后的所有内容（第一个）。下载一个页面，手动模拟它以查看beautifulsoup可接受的内容，然后编写一些正则表达式进行预处理。

Answer 3

BS不再开发了 - 建议您查看lxml

无法访问该特定网址，但我能够使用html片段（我添加了a标记）来实现此功能

>>> soup = lxml.html.document_fromstring(u)
>>> soup.cssselect('a')
>>> soup.cssselect('a')[0].text_content() #for example

使用BeautifulSoup从HTML表中提取链接，使用不干净的源代码

3 个答案: