使用Go解析维基百科信息框?

时间:2016-04-20 01:20:25

标签: regex go wikipedia

我正在尝试为一些维基百科文章解析信息框,似乎无法弄明白。我已下载文件和Albert Einstein以及我尝试解析信息框looks like this

package main

import (
    "log"
    "regexp"
)

func main() {
    st := `{{redirect|Einstein|other uses|Albert Einstein (disambiguation)|and|Einstein (disambiguation)}}
        {{pp-semi-indef}}
        {{pp-move-indef}}
        {{Good article}}
        {{Infobox scientist
        | name       = Albert Einstein
        | image       = Einstein 1921 by F Schmutzer - restoration.jpg
        | caption     = Albert Einstein in 1921
        | birth_date  = {{Birth date|df=yes|1879|3|14}}
        | birth_place = [[Ulm]], [[Kingdom of Württemberg]], [[German Empire]]
        | death_date  = {{Death date and age|df=yes|1955|4|18|1879|3|14}}
        | death_place = {{nowrap|[[Princeton, New Jersey]], U.S.}}
        | children    = [[Lieserl Einstein|"Lieserl"]] (1902–1903?)<br />[[Hans Albert Einstein|Hans Albert]] (1904–1973)<br />[[Eduard Einstein|Eduard "Tete"]] (1910–1965)
        | spouse      = [[Mileva Marić]]&nbsp;(1903–1919)<br />{{nowrap|[[Elsa Löwenthal]]&nbsp;(1919–1936)}}
        | residence   = Germany, Italy, Switzerland, Austria (today: [[Czech Republic]]), Belgium, United States
        | citizenship = {{Plainlist|
        * [[Kingdom of Württemberg]] (1879–1896)
        * [[Statelessness|Stateless]] (1896–1901)
        * [[Switzerland]] (1901–1955)
        * Austria of the [[Austro-Hungarian Empire]] (1911–1912)
        * Germany (1914–1933)
        * United States (1940–1955)
        }}
        | ethnicity  = Jewish
        | fields    = [[Physics]], [[philosophy]]
        | workplaces = {{Plainlist|
        * [[Swiss Patent Office]] ([[Bern]]) (1902–1909)
        * [[University of Bern]] (1908–1909)
        * [[University of Zurich]] (1909–1911)
        * [[Karl-Ferdinands-Universität|Charles University in Prague]] (1911–1912)
        * [[ETH Zurich]] (1912–1914)
        * [[Prussian Academy of Sciences]] (1914–1933)
        * [[Humboldt University of Berlin]] (1914–1917)
        * [[Kaiser Wilhelm Institute]] (director, 1917–1933)
        * [[German Physical Society]] (president, 1916–1918)
        * [[Leiden University]] (visits, 1920–)
        * [[Institute for Advanced Study]] (1933–1955)
        * [[Caltech]] (visits, 1931–1933)
        }}
        | alma_mater = {{Plainlist|
        * [[ETH Zurich|Swiss Federal Polytechnic]] (1896–1900; B.A., 1900)
        * [[University of Zurich]] (Ph.D., 1905)
        }}
        | doctoral_advisor  = [[Alfred Kleiner]]
        | thesis_title      = Eine neue Bestimmung der Moleküldimensionen (A New Determination of Molecular Dimensions)
        | thesis_url        = http://e-collection.library.ethz.ch/eserv/eth:30378/eth-30378-01.pdf
        | thesis_year       = 1905
        | academic_advisors = [[Heinrich Friedrich Weber]]
        | influenced  = {{Plainlist|
        * [[Ernst G. Straus]]
        * [[Nathan Rosen]]
        * [[Leó Szilárd]]
        }}
        | known_for = {{Plainlist|
        * [[General relativity]] and [[special relativity]]
        * [[Photoelectric effect]]
        * ''[[Mass–energy equivalence|E=mc<sup>2</sup>]]''
        * Theory of [[Brownian motion]]
        * [[Einstein field equations]]
        * [[Bose–Einstein statistics]]
        * [[Bose–Einstein condensate]]
        * [[Gravitational wave]]
        * [[Cosmological constant]]
        * [[Classical unified field theories|Unified field theory]]
        * [[EPR paradox]]
        }}
        | awards = {{Plainlist|
        * [[Barnard Medal for Meritorious Service to Science|Barnard Medal]] (1920)
        * [[Nobel Prize in Physics]] (1921)
        * [[Matteucci Medal]] (1921)
        * [[ForMemRS]] (1921)<ref name="frs" />
        * [[Copley Medal]] (1925)<ref name="frs" />
        * [[Max Planck Medal]] (1929)
        * [[Time 100: The Most Important People of the Century|''Time'' Person of the Century]] (1999)
        }}
        | signature = Albert Einstein signature 1934.svg
    }}
    '''Albert Einstein''' ({{IPAc-en|ˈ|aɪ|n|s|t|aɪ|n}};<ref>{{cite book|last=Wells|first=John|authorlink=John C. Wells|title=Longman Pronunciation Dictionary|publisher=Pearson Longman|edition=3rd|date=April 3, 2008|isbn=1-4058-8118-6}}</ref> {{IPA-de|ˈalbɛɐ̯t ˈaɪnʃtaɪn|lang|Albert Einstein german.ogg}}; 14 March 1879&nbsp;– 18 April 1955) was a German-born<!-- Please do not change this—see talk page and its many archives.-->
     [[theoretical physicist]]. He developed the [[general theory of relativity]], one of the two pillars of [[modern physics]] (alongside [[quantum mechanics]]).<ref name=frs>{{cite journal | last1 = Whittaker | first1 = E. | authorlink = E. T. Whittaker| doi = 10.1098/rsbm.1955.0005 | title = Albert Einstein. 1879–1955 | journal = [[Biographical Memoirs of Fellows of the Royal Society]] | volume = 1 | pages = 37–67 | date = 1 November 1955| jstor = 769242}}</ref><ref name="YangHamilton2010">{{cite book|author1=Fujia Yang|author2=Joseph H. Hamilton|title=Modern Atomic and Nuclear Physics|date=2010|publisher=World Scientific|isbn=978-981-4277-16-7}}</ref>{{rp|274}} Einstein's work is also known for its influence on the [[philosophy of science]].<ref>{{Citation |title=Einstein's Philosophy of Science |url=http://plato.stanford.edu/entries/einstein-philscience/#IntWasEinEpiOpp |we......
    `

    re := regexp.MustCompile(`{{Infobox(?s:.*?)}}`)
    log.Println(re.FindAllStringSubmatch(st, -1))

}

我试图将信息框中的每个项目放入结构或地图中:

m["name"] = "Albert Einstein"
m["image"] = "Einstein...."
...
...
m["death_date"] = "{{Death date and age|df=yes|1955|4|18|1879|3|14}}"
...
...

我甚至无法隔离信息框。我明白了:

[[{{Infobox scientist
        | name       = Albert Einstein
        | image       = Einstein 1921 by F Schmutzer - restoration.jpg
        | caption     = Albert Einstein in 1921
        | birth_date  = {{Birth date|df=yes|1879|3|14}}]]

API中的Albert Einstein条目可在以下网址找到:

https://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=revisions&rvprop=content&format=json

编辑:

根据接受的答案to this question,我尝试了以下正则表达式:

(?=\{Infobox)(\{([^{}]|(?1))*\})

但得到:

panic: regexp: Compile(`(?=\{Infobox)(\{([^{}]|(?1))*\})`): error parsing regexp: invalid or unsupported Perl syntax: `(?=`
编辑#2:编辑#2: 如果有办法通过他们的API提取信息,那么我将采取这种方式......我一直在阅读文档并且无法找到它。

1 个答案:

答案 0 :(得分:0)

我做了一个可能适合你的正则表达式:

^\s*\|\s*([^\s]+)\s*=\s*(\{\{Plainlist\|(?:\n\s*\*.*)*|.*)

解释

  • 此部分:^\s*\|\s*([^\s]+)\s*=\s*匹配行的开头,如:

        | <the_label> = 
    
  • 继续在同一行,这部分:(\{\{Plainlist\|(?:\n\s*\*.*)*|.*)将匹配列表:

                         {{Plainlist|
    * [[Ernst G. Straus]]
    * [[Nathan Rosen]]
    * [[Leó Szilárd]]
    

(请注意,它可能会省略最后的}}。哦,好吧。)

  • 如果没有列表,则匹配到行尾。