如何从维基百科页面的信息框中提取数据?

时间:2018-10-20 09:36:36

标签: python web-scraping extract wikipedia

我的目的是从Wikipedia page of Microsoft的信息框中提取“ Founded”和“ Products”信息。我正在使用python 3,并且使用了以下在网上找到的代码,但该代码无法正常工作

# importing modules 
import requests 
from lxml import etree 
# manually storing desired URL 
url='https://en.wikipedia.org/wiki/Microsoft'

# fetching its url through requests module   
req = requests.get(url)  

store = etree.fromstring(req.text) 

# trying to get the 'Founded' portion of above  
# URL's info box of Wikipedia's page 
output = store.xpath('//table[@class="infoboxvcard"]/tr[th/text()="Founded"]/td/i')  

# printing the text portion 
print output[0].text   

#Expected result:
 Founded:April 4, 1975; 43 years ago in Albuquerque, New Mexico, U.S.

2 个答案:

答案 0 :(得分:2)

使用了不正确的Xpath。我从问题中提供的Wikipedia页面检索到该元素的正确XPath。我还在括号中添加了用于Python 3兼容性的print语句。

尝试:

# importing modules
import requests
from lxml import etree
# manually storing desired URL
url='https://en.wikipedia.org/wiki/Microsoft'

# fetching its url through requests module
req = requests.get(url)

store = etree.fromstring(req.text)

# an incorrect xpath was being used
output = store.xpath('//*[@id="mw-content-text"]/div/table[2]/tbody/tr[7]/td')

# added parenthesis python 3 
print (output[0].text)

我得到:

April 4, 1975

答案 1 :(得分:0)

您可能应该使用mwparserfromhell来尝试自行解析mediawiki标记。使用mwparsefromhell,您可以过滤出模板,然后提取它们的各个参数。

code = mwparserfromhell.parse(text)
for template in code.filter_templates():
    if template.name.matches("infobox"):
       for p in template:#...

https://github.com/earwig/mwparserfromhell