Question

我们正在尝试使用现有的

tokenzation
句子分裂
和命名实体标记

虽然我们想使用Stanford CoreNlp另外为我们提供

词性标注
词形还原
并解析

目前，我们正在尝试以下方式：

1）为“pos，lemma，parse”制作一个注释器

Properties pipelineProps = new Properties();
pipelineProps.put("annotators", "pos, lemma, parse");
pipelineProps.setProperty("parse.maxlen", "80");
pipelineProps.setProperty("pos.maxlen", "80");
StanfordCoreNLP pipeline = new StanfordCoreNLP(pipelineProps);

2）用自定义方法读入句子：

List<CoreMap> sentences = getSentencesForTaggedFile(idToDoc.get(docId));

在该方法中，令牌的构造方式如下：

CoreLabel clToken = new CoreLabel();
clToken.setValue(stringToken);
clToken.setWord(stringToken);
clToken.setOriginalText(stringToken);
clToken.set(CoreAnnotations.NamedEntityTagAnnotation.class, neTag);
sentenceTokens.add(clToken);

并将它们组合成这样的句子：

Annotation sentence = new Annotation(sb.toString());
sentence.set(CoreAnnotations.TokensAnnotation.class, sentenceTokens);
sentence.set(CoreAnnotations.TokenBeginAnnotation.class, tokenOffset);
tokenOffset += sentenceTokens.size();
sentence.set(CoreAnnotations.TokenEndAnnotation.class, tokenOffset);
sentence.set(CoreAnnotations.SentenceIndexAnnotation.class, sentences.size());

3）句子列表传递给管道：

  Annotation document = new Annotation(sentences);
  pipeline.annotate(document);

但是，运行此操作时，我们会收到以下错误：

null: InvocationTargetException: annotator "pos" requires annotator "tokenize"

有什么指示我们如何才能实现我们想做的事情？

Answer 1

由于“pos”注释器（POSTaggerAnnotator类的实例）所期望的要求不满意而引发异常

StanfordCoreNLP知道如何创建的注释器的要求在Annotator接口中定义。对于“pos”注释器的情况，定义了2个要求：

tokenize
SSPLIT

这两个要求都需要满足，这意味着必须在“pos”注释器之前的注释器列表中指定“tokenize”注释器和“ssplit”注释器。

现在回到问题...如果你想在管道中跳过“tokenize”和“ssplit”注释，你需要禁用在管道初始化期间执行的需求检查。我发现了两种可行的方法：

禁用传递给StanfordCoreNLP构造函数的属性对象中的需求强制执行：

props.setProperty("enforceRequirements", "false");
将StanfordCoreNLP构造函数的enforceRequirements参数设置为false

StanfordCoreNLP pipeline = new StanfordCoreNLP(props, false);

Answer 2

你应该添加参数“tokenize”

pipelineProps.put("annotators", "tokenize, pos, lemma, parse");

Stanford CoreNLP：使用部分现有注释

2 个答案: