Question

我正在使用CoreNLP进行命名实体提取，并遇到了一些问题。问题在于，只要命名实体由多个标记组成，例如“Han Solo”，注释器就不会将“Han Solo”作为单个命名实体返回，而是作为两个单独的实体，“Han”“Solo”

是否可以将命名实体作为一个令牌？我知道我可以在这个范围内使用带有classifyWithInlineXML的CRFClassifier，但是我的解决方案要求我使用CoreNLP，因为我也需要知道单词编号。

以下是我到目前为止的代码：

    Properties props = new Properties();
    props.put("annotators", "tokenize,ssplit,pos,lemma,ner,parse");
    props.setProperty("ner.model", "edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz");
    pipeline = new StanfordCoreNLP(props);
    Annotation document = new Annotation(text);
    pipeline.annotate(document);
    List<CoreMap> sentences = document.get(SentencesAnnotation.class);
    for (CoreMap sentence : sentences) {
        for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
                System.out.println(token.get(NamedEntityTagAnnotation.class));
        }
    }

帮助我Obi-Wan Kenobi。你是我唯一的希望。

Answer 1

PrintWriter writer = null;
 try {  
     String inputLine = "Several possible plans emerged from the talks, held at the Federal Reserve Bank of New York" + " and led by Timothy R. Geithner, the president of the New York Fed, and Treasury Secretary Henry M. Paulson Jr.";

     String serializedClassifier = "english.all.3class.distsim.crf.ser.gz";
     AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(serializedClassifier);

     writer = new PrintWriter(new File("output.xml"));
     writer.println("<Sentences>");
     writer.flush();
     String output ="<Sentence>"+classifier.classifyToString(inputLine, "xml", true)+"</Sentence>"; 
     writer.println(output);
     writer.flush();
     writer.println("</Sentences>");
     writer.flush(); 
 } catch (FileNotFoundException ex) {
     ex.printStackTrace();
 } finally {
     writer.close();
 }

我能够提出这个解决方案。我正在将输出写入XML文件＆＃34; output.xml＆＃34;。从获得的输出中，您可以将xml中的连续节点与＆＃34; PERSON＆＃34;或＆＃34;组织＆＃34;或＆＃34; LOCATION＆＃34;归属于一个实体。这种格式默认生成单词计数。

以下是xml输出的快照。

<wi num="11" entity="ORGANIZATION">Federal</wi>
<wi num="12" entity="ORGANIZATION">Reserve</wi>
<wi num="13" entity="ORGANIZATION">Bank</wi>
<wi num="14" entity="ORGANIZATION">of</wi>
<wi num="15" entity="ORGANIZATION">New</wi>
<wi num="16" entity="ORGANIZATION">Yorkand</wi>

从上面的输出可以看出，连续的单词被识别为＆＃34;组织＆＃34;。所以这些词可以合并为一个实体。

Answer 2

我使用一个临时变量来保存前一个ner标签并检查当前的ner标签是否等于temp，它将两个单词组合在一起。并且迭代通过将temp分配给当前的ner标签。

使用CoreNLP提取多个单词命名实体

2 个答案: