Question

import urllib2

from BeautifulSoup import *

resp = urllib2.urlopen("file:///D:/sample.html")

rawhtml = resp.read()

resp.close()
print rawhtml

我正在使用此代码从html文档中获取文本，但它也为我提供了html代码。我该怎么做才能从html文档中获取文本？

Answer 1

请注意，您的示例不使用Beautifulsoup。请参阅doc，并按照示例进行操作。

以下示例摘自上面的链接，在soup中搜索<td>元素。

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print

Answer 2

非常模块文档可以从文档中提取所有字符串。 @ http://www.crummy.com/software/BeautifulSoup/

from BeautifulSoup import BeautifulSoup
import urllib2

resp = urllib2.urlopen("http://www.google.com")
rawhtml = resp.read()
soup = BeautifulSoup(rawhtml)

all_strings = [e for e in soup.recursiveChildGenerator() 
         if isinstance(e,unicode)])
print all_strings

Answer 3

改编自Tony Segaran的编程集体智慧（第60页）：

def gettextonly(soup):
    v=soup.string
    if v == None:
        c=soup.contents
        resulttext=''
        for t in c:
            subtext=gettextonly(t)
            resulttext+=subtext+'\n'
        return resulttext
    else:
        return v.strip()

使用示例：

>>>from BeautifulSoup import BeautifulSoup

>>>doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
>>>''.join(doc)
'<html><head><title>Page title</title></head><body><p id="firstpara" align="center">
This is paragraph <b>one</b>.<p id="secondpara" align="blah">This is
paragraph<b>two</b>.</html>'

>>>soup = BeautifulSoup(''.join(doc))
>>>gettextonly(soup)
u'Page title\n\nThis is paragraph\none\n.\n\nThis is paragraph\ntwo\n.\n\n\n\n'

请注意，结果是单个字符串，其中不同标记内的文本由换行符（\ n）分隔。

如果您想将文本的所有单词作为单词列表提取，您可以使用以下功能，也可以改编自Tony Segaran的编程集体智慧（第61页）：

import re
def separatewords(text):
    splitter=re.compile('\\W*')
    return [s.lower() for s in splitter.split(text) if s!='']

使用示例：

>>>separatewords(gettextonly(soup))
[u'page', u'title', u'this', u'is', u'paragraph', u'one', u'this', u'is', 
u'paragraph', u'two']

Answer 4

还有html2text。

另一种选择是将其传递给“lynx -dump”

Answer 5

我一直在使用带有漂亮汤的html2text包来修复包的一些问题。例如html2text不了解auml或ouml文字，只有Auml和Ouml有大写的第一个字母。

unicode_coded_entities_html = unicode(BeautifulStoneSoup(html,convertEntities=BeautifulStoneSoup.HTML_ENTITIES))
text = html2text.html2text(unicode_coded_entities_html)

html2text会转换为markdown文本语法，因此转换后的文本也可以渲染回html格式（当然，某些信息会在转换中丢失）。

使用python语言进行html到文本转换

5 个答案: