使用TokensRegex开发基于规则的NER并根据上下文单词对实体进行分类

时间:2017-04-05 10:32:09

标签: java stanford-nlp named-entity-recognition

我使用下面给出的简单规则文件来检测文本中的命名实体。此规则的示例如下:

  

比尔盖茨,微软总裁兼董事长。

这里第一个NNP postag指的是PERSON Bill Gates,第二个NNP postag指的是Microsoft的组织。

我得到了一个空输出。

我想我不确定如何捕获PERSON和ORGANIZATION实体。我应该在我的规则文件中进行哪些更改,以便捕获这些组或至少一个组织,比如组织?

$TITLES_CORPORATE = "/chief administrative officer|cao|chief marketing officer|cmo|chief operating officer|coo|chief privacy officer|cpo|chief process officer|chief product officer|chief reputation officer|cro|chief research officer|chief restructuring officer|chief risk officer|chief science officer|cso|chief scientific Officer|chief security officer|chief services officer|chief strategy officer|chief sustainability officer|chief technology officer|vice chairman|general manager|gm|manager/";
$TITLE_PREFIXES = "/senior|executive|assistant|deputy|chief|general|staff/";


 {
      ruleType: "tokens",
      pattern:  ( [ { pos:NNP } ]+ ($TITLE_PREFIXES)? TITLES_CORPORATE /,/? /of/? [ { pos:NNP } ]+ ), 
      result: "ORGANIZATION"

       }

这是我的代码:

public static void main(String[] args)
{
     String rulesFile = "D:\\Workspace\\resource\\NERRulesFile.txt";
     String dataFile = "D:\\Workspace\\\resource\\GoldSetSentences.txt";

     Properties props = new Properties();
     props.put("annotators", "tokenize, ssplit, pos, lemma");
     StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
     Annotation document = new Annotation(dataFile);
     pipeline.annotate(document);
     List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);


    //List<CoreLabel> tokens = new ArrayList<CoreLabel>();
    CoreMapExpressionExtractor extractor = CoreMapExpressionExtractor.createExtractorFromFiles(TokenSequencePattern.getNewEnv(), rulesFile);

    for (CoreMap sentence:sentences) {
      List<MatchedExpression> matched = extractor.extractExpressions(sentence);
      System.out.println(matched);
    }
  }

1 个答案:

答案 0 :(得分:0)

这是关于令牌的例子:

[Bill, Gates, President, and, Chairman, of, Microsoft, Corp, .]

TokensRegex规则超过TOKENS,因此正则表达式需要匹配令牌。因此,您的一个示例根本不起作用,因为它包含多个令牌表达式。

这是一个与上述例子中“微软公司总裁兼董事长”匹配的模式:

pattern: (/President/ /and/? /Chairman/ /of/? [{pos: NNP}]+)