从Wikipedia Infobox数据解析网站

时间:2017-08-07 13:39:11

标签: python parsing wikipedia wikipedia-api

我正在使用维基百科api来获取信息框数据。我想从这个信息框数据中解析website url。我尝试使用mwparserfromhell来解析网站网址,但不同的关键字有不同的格式。

以下是网站的几种模式 -

url                  = <!-- {{URL|www.example.com}} -->
| url = [https://www.TheGuardian.com/ TheGuardian.com]
| url = <span class="plainlinks">[https://www.naver.com/ www.naver.com]</span>
|url             = [https://www.tmall.com/ tmall.com]
|url            = [http://www.ustream.tv/ ustream.tv]

我需要帮助解析维基百科支持的所有模式的official website link吗?

修改

代码 -

# get infobox data
import requests
# keyword
keyword = 'stackoverflow.com'
# wikipedia api url
api_url = (
    'https://en.wikipedia.org/w/api.php?action=query&prop=revisions&'
    'rvprop=content&titles=%s&rvsection=0&format=json' % keyword)
# api request
resp = requests.get(api_url).json()
page_one = next(iter(resp['query']['pages'].values()))
revisions = page_one.get('revisions', [])
# infobox daa
infobox_data = next(iter(revisions[0].values()))

# parse website url
import mwparserfromhell
wikicode = mwparserfromhell.parse(infobox_data)
templates = wikicode.filter_templates()
website_url_1 = ''
website_url_2 = ''
for template in templates:
    # Pattern - `URL|http://x.com`
    if template.name == "URL":
        website_url_1 = str(template.get(1).value)
        break
    if not website_url_1:
        # Pattern - `website = http://x.com`
        try:
            website_url_2 = str(template.get("website").value)
        except ValueError:
            pass
    if not website_url_1:
        # Pattern - `homepage = http://x.com`
        try:
            website_url_2 = str(template.get("homepage").value)
        except ValueError:
            pass
if website_url_1:
    website_url = website_url_1
elif website_url_2:
    website_url = website_url_2

3 个答案:

答案 0 :(得分:0)

可以使用正则表达式和BeautifulSoup解析您提到的模式。可以想象,人们可以通过扩展这种方法来解析其他模式。

我删除了包含&#39; url =&#39;从一行开始,然后使用BeautifulSoup处理剩余部分。由于BeautifulSoup封装了用于形成完整页面的内容,因此原始内容可以作为body元素的文本获取。

>>> import re
>>> patterns = '''\
... url                  = <!-- {{URL|www.example.com}} -->
... | url = [https://www.TheGuardian.com/ TheGuardian.com]
... | url = <span class="plainlinks">[https://www.naver.com/ www.naver.com]</span>
... |url             = [https://www.tmall.com/ tmall.com]
... |url            = [http://www.ustream.tv/ ustream.tv]'''
>>> import bs4
>>> regex = re.compile(r'\s*\|?\s*url\s*=\s*', re.I)
>>> for pattern in patterns.split('\n'):
...     soup = bs4.BeautifulSoup(re.sub(regex, '', pattern), 'lxml')
...     if str(soup).startswith('<!--'):
...         'just a comment'
...     else:
...         soup.find('body').getText()
... 
'just a comment'
'[https://www.TheGuardian.com/ TheGuardian.com]'
'[https://www.naver.com/ www.naver.com]'
'[https://www.tmall.com/ tmall.com]'
'[http://www.ustream.tv/ ustream.tv]'

答案 1 :(得分:0)

mwparserfromhell是一个很好的工具:

import mwclient
import mwparserfromhell

site = mwclient.Site('en.wikipedia.org')
text = site.pages[pagename].text()
wikicode = mwparserfromhell.parse(text)
templates = wikicode.filter_templates(matches='infobox .*')
url = templates[0].get('url').value

url_template = url.filter_templates(matches='url')
url_link = url.filter_external_links()
if url_template:
    print url_template[0].get(1)
elif url_link:
    print url_link.url
else:
    print url

答案 2 :(得分:0)

我写了this snippet,这可能会有所帮助:

import collections
import wikipedia
from bs4 import BeautifulSoup

def infobox(wiki_page):
    """Returns the infobox of a given wikipedia page"""
    if isinstance(wiki_page, str):
        wiki_page = wikipedia.page(wiki_page)
    try:
        soup = BeautifulSoup(wiki_page.html()).find_all("table", {"class": "infobox"})[0]
    except:
        return None
    ret = collections.defaultdict(dict)
    section = ""
    for tr in soup.find_all("tr"):
        th = tr.find_all("th")
        if not any(th):
            continue
        th = th[0]
        if str(th.get("colspan"))=='2':
            section = th.text.translate({160:' '}).strip()
            continue
        k = th.text.translate({160:' '}).strip()
        try:
            v = tr.find_all("td")[0].text.translate({160:' '}).strip()
            ret[section][k] = v
        except IndexError:
            continue
    return dict(ret)