如何使用Python从HTML文档中提取信息?

时间:2014-09-28 02:00:34

标签: python html

我需要python从HTML文件中提取一些数据。

我现在使用的代码如下:

import urllib
recent = urllib.urlopen(http://gamebattles.majorleaguegaming.com/ps4/call-of-duty-ghosts/team/TeamCrYpToNGamingEU/match?id=46057240)
recentsource = recent.read() 

我现在需要这个,然后打印另一个团队的该网页表格中的游戏玩家标签列表。

我该怎么做?

由于

3 个答案:

答案 0 :(得分:2)

查看Beautiful Soup模块,这是一个很棒的文本解析器。

如果您不想或不能安装它,可以下载源代码,然后将.py文件放在与程序相同的目录中。

为此,请从网站下载并提取代码,然后复制" bs4"将目录放入与python脚本相同的文件夹中。

然后,将其放在代码的开头:

from bs4 import BeautifulSoup
# or
from bs4 import BeautifulSoup as bs 
# To type bs instead of BeautifulSoup every single time you use it

您可以从其他stackoverflow问题中学习如何使用它,或查看documentation

答案 1 :(得分:0)

您可以使用html2text完成此项工作,也可以使用ntlk

示例代码

import nltk   
from urllib import urlopen
url = "http://any-url"    
html = urlopen(url).read()   
raw = nltk.clean_html(html)

print(raw)

答案 2 :(得分:0)

pyparsing有一些有用的构造,用于从HTML页面中提取数据,结果往往是自构造和自命名(如果正确设置了解析器/扫描程序)。以下是此特定网页的pyparsing解决方案:

from pyparsing import *

# for stripping HTML tags
anyTag,anyClose = makeHTMLTags(Word(alphas,alphanums+":_"))
commonHTMLEntity.setParseAction(replaceHTMLEntity)
stripHTML = lambda tokens: (commonHTMLEntity | Suppress(anyTag | anyClose) ).transformString(''.join(tokens))             

# make pyparsing expressions for HTML opening and closing tags
# (suppress all from results, as there is no interesting content in the tags or their attributes)
h3,h3End = map(Suppress,makeHTMLTags("h3"))
table,tableEnd = map(Suppress,makeHTMLTags("table"))
tr,trEnd = map(Suppress,makeHTMLTags("tr"))
th,thEnd = map(Suppress,makeHTMLTags("th"))
td,tdEnd = map(Suppress,makeHTMLTags("td"))

# nothing interesting in column headings - parse them, but suppress the results
colHeading = Suppress(th + SkipTo(thEnd) + thEnd)

# simple routine for defining data cells, with optional results name
colData = lambda name='' : td + SkipTo(tdEnd)(name) + tdEnd

playerListing = Group(tr + colData() + colData() + 
                        colData("username") + 
                        colData().setParseAction(stripHTML)("role") + 
                        colData("networkID") + 
                        trEnd)

teamListing = (h3 + ungroup(SkipTo("Match Players" + h3End, failOn=h3))("name") + "Match Players" + h3End +
                table + tr + colHeading*5 + trEnd +
                Group(OneOrMore(playerListing))("players"))



for team in teamListing.searchString(recentsource):
    # use this to print out names and structures of results
    #print team.dump()
    print "Team:", team.name
    for player in team.players:
        print "- %s: %s (%s)" % (player.role, player.username, player.networkID)
        # or like this
        # print "- %(role)s: %(username)s (%(networkID)s)" % player
    print

打印:

Team: Team CrYpToN Gaming EU
- Leader: CrYpToN_Crossy (CrYpToN_Crossy)
- Captain: Juddanorty (CrYpToN_Judd)
- Member: BLaZe_Elfy (CrYpToN_Elfy)

Team: eXCeL™
- Leader: Caaahil (Caaahil)
- Member: eSportsmanship (eSportsmanship)
- Member: KillBoy-NL (iClown-x)