如何提取"信息框公司"来自wiki转储的数据

时间:2017-09-12 15:13:37

标签: java xml parsing wikipedia

我从https://dumps.wikimedia.org/enwiki/20170520/

下载了一个大型wiki转储XML文件

我想从此wikidumps中提取元数据公司名称和母公司。所有公司数据都位于XML模板中,如下所示:

{{Infobox company
| name =
| logo = 
| type = 
| industry = 
| fate = 
| predecessor = <!-- or: | predecessors = -->
| successor = <!-- or: | successors = -->
| founded = <!-- if known: {{Start date and age|YYYY|MM|DD}} in [[city]], [[state]], [[country]] -->
| founder = <!-- or: | founders = -->
| defunct = <!-- {{End date|YYYY|MM|DD}} -->
| hq_location_city = 
| hq_location_country = 
| area_served = <!-- or: | areas_served = -->
| key_people = 
| products = 
| owner = <!-- or: | owners = -->
| num_employees = 
| num_employees_year = <!-- Year of num_employees data (if known) -->
| parent = 
| website = <!-- {{URL|example.com}} -->
}}

我做了一些研究,发现了MediaWiki Parser。 参考:https://github.com/dkpro/dkpro-jwpl/blob/master/de.tudarmstadt.ukp.wikipedia.parser/src/main/java/de/tudarmstadt/ukp/wikipedia/parser/tutorial/T1_SimpleParserDemo.java

https://dkpro.github.io/dkpro-jwpl/JWPLParser/

我尝试使用此解析器。但它需要将文件转换为字符串。我的wiki转储XML文件大小为60 GB。我无法用字符串转换这个大文件并保留在内存中。此外,Mediawiki解析器没有关于如何查找 Infobox公司等特定元素的说明,进入其中并提取名称和其他字段。以下是Mediawiki解析器的示例代码:

public static void main(String[] args) throws IOException {

    File file = new File("C:/Users/njaiswal/Downloads/accenture_data_from_wikidumps.xml");
    String str = FileUtils.readFileToString(file);

    // get a ParsedPage object
    MediaWikiParserFactory pf = new MediaWikiParserFactory();
    MediaWikiParser parser = pf.createParser();
    ParsedPage pp = parser.parse(str);
    // get the sections


    for (Section section : pp.getSections()) {
        System.out.println("section : " + section.getTitle());
        System.out.println(" nr of paragraphs      : " + section.nrOfParagraphs());
        System.out.println(" nr of tables          : " + section.nrOfTables());
        System.out.println(" nr of nested lists    : " + section.nrOfNestedLists());
        System.out.println(" nr of definition lists: " + section.nrOfDefinitionLists());


      for (Link link : section.getLinks(Link.type.INTERNAL)) {
          System.out.println("  " + link.getTarget());
      }
}

}

还有其他解析器可以解决我的问题吗?或者我可以使用相同的MediaWiki Parser来访问&#34; Inbox公司&#34;并提取字段?任何帮助表示赞赏。感谢

更新:我试图使用Khalil建议的wikiXMLj解析器。我能够得到所有的&#34;信息框&#34;数据,但我想将此限制为&#34; Infobox公司&#34;数据。以下是我的代码和输出:

import edu.jhu.nlp.wikipedia.*;
    public class Test {

    public static void main(String[] args) throws Exception{
        WikiXMLParser parser = WikiXMLParserFactory.getSAXParser("C:/Users/njaiswal/Downloads/enwiki-20170520-pages-articles-multistream.xml/enwiki-20170520-pages-articles-multistream.xml");
            parser.setPageCallback(new PageCallbackHandler() {
                public void process(WikiPage page) {
                  try {
                    InfoBox infobox=page.getInfoBox();
                    System.out.println(infobox.dumpRaw());
                } catch (WikiTextParserException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
                   //do something with info box
                }
            });
            parser.parse();
    }

}

O / P:

{{Infobox Monarch
| name            = Attila
| title           = [[List of Hunnic rulers|Ruler]] of the [[Hunnic Empire]]
| place of burial = 
}}
{{Infobox sea
| name = Aegean Sea
| image = Aegean Sea map.png
| caption = Map of the Aegean Sea
| pushpin_map = World
| pushpin_map_alt = World
| pushpin_label_position = right
}}
{{Infobox company
| name             = Audi AG 
| logo             = Audi-Logo 2016.svg
| logo_size = 235
| image            = Audi Ingolstadt.jpg
| image_size = 265
}}

1 个答案:

答案 0 :(得分:0)

我在wikixmlj非常简单的哑语解析器之前使用过。这将完美地解析它:

// dumpPath should be like C:\your/Path/articles.xml.bz2"
WikiXMLParser wxsp = WikiXMLParserFactory.getSAXParser(dumpPath);
wxsp.setPageCallback(new PageCallbackHandler() {
    @Override
    public void process(WikiPage page) {
        //System.out.println("info box:" + page.getInfoBox());

       String regex = "\\{{Infobox company(.|\\n)+";
       Pattern pattern = Pattern.compile(regex);
       Matcher matcher = pattern.matcher(page.getInfoBox());
       while (matcher.find()) {
       System.out.println(matcher.group(0));}

}
    });
    wxsp.parse(); }

demo of the regex