Question

我在分类工具中使用data.relativePath = "/my-cool-site"; var source = '{#contents}<a href="{@relativeTo path=relativePath}{href}{/relativeTo}">{title}</a>{/contents}';进行字符串标记化。我想获得有意义的单词，但我得到非单词标记（例如Stanford NLP，---，>等），而不是.，{{ 1}}，am（停用词）。有人知道解决这个问题的方法吗？

Answer 1

在stanford Corenlp中，有stopword removal annotator提供删除标准停用词的功能。您也可以根据需要在这里定义自定义停用词（即---，＆lt;，等等）

您可以看到示例here：

   Properties props = new Properties();
   props.put("annotators", "tokenize, ssplit, stopword");
   props.setProperty("customAnnotatorClass.stopword", "intoxicant.analytics.coreNlp.StopwordAnnotator");

   StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
   Annotation document = new Annotation(example);
   pipeline.annotate(document);
   List<CoreLabel> tokens = document.get(CoreAnnotations.TokensAnnotation.class);

这里是上面的例子＆＃34; tokenize，ssplit，stopwords＆＃34;被设置为自定义停用词。

希望它能帮助你...... !!

Answer 2

这是一项非常特定于域的任务，我们不会在CoreNLP中为您执行此任务。你应该可以使用正则表达式过滤器和CoreNLP标记器顶部的stopword过滤器来完成这项工作。

此处an example list of English stopwords。

使用Stanford NLP进行文本标记化：过滤不需要的单词和字符

2 个答案: