打开nlp培训命名实体

时间:2017-03-28 09:58:45

标签: opennlp named-entity-recognition

我正在训练一个名为实体识别的模型,但它没有正确识别人名?

我的训练数据如下:

<START:person> Pierre Vinken <END>  , 61 years old , will join the board as a nonexecutive director Nov. 29 . A nonexecutive  director has many similar responsibilities as an executive director.However, there are no voting rights with this position.`
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V., the Dutch publishing group. 
The former chairman of the society  <START:person> Rudolph Agnew <END> will be assisting <START:person> Vinken <End> in his activities. 
Mr . <START:person> Vinken <END> is the most right person in the industry.
His competitior <START:person> Steve <END> is vice chairman of Himbeldon N.V., the Ericson publishing group.
<START:person> Vinken <END> will also be assisted by <START:person> Angelina Tucci <END>  who has been recognized many times For Her Good Work. 
<START:person> Juilie <END>  vp of Weterwood A.B., THE ZS publishing group also supported him.
Mr . <START:person> Stewart <END> is a recruiter of Metric C.D., the Drishti publishing.
He recruited <START:person> Adam <END>  who will work on nlp  for <START:person> Vinken <END> .
The lead conference  for appointing him as a director was held by <START:person> Daniel Smith <END> at Boston.

用于训练模型的java文件是:

public class NamedEntityModel {
    public static void train(String inputfile,String modelfile) throws IOException {
        Charset charset = Charset.forName("UTF-8");
        MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory( new File(inputfile));
        ObjectStream<String> lineStream = new PlainTextByLineStream( factory, charset);
        ObjectStream<NameSample> sampleStream = new NameSampleDataStream( lineStream);
        TokenNameFinderModel model = null;

        try {
                        model = NameFinderME.train("en", "person", sampleStream,TrainingParameters.defaultParams(),
                                 new TokenNameFinderFactory());

        } finally {
                        sampleStream.close();
        }
        BufferedOutputStream modelOut = null;
        try {
                        modelOut = new BufferedOutputStream(new FileOutputStream(modelfile));
                        model.serialize(modelOut);
        } finally {
                        if (modelOut != null)
                                        modelOut.close();
        }
}
}

这就是主要课程的外观:

public class NameFinder {
    public static void main(String [] args) throws IOException{
        String inputfile="C:/setup/apache-opennlp-1.7.2/bin/ner_training_data.txt";
        String modelfile="C:/setup/apache-opennlp-1.7.2/bin/en-tr-ner-person.bin";

        NamedEntityModel.train(inputfile, modelfile);

    String sentence ="Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . Mr . Vinken is chairman of Elsevier N.V. , the Dutch publishing group. Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a director of this British industrial conglomerate . Peter is on leave today . "
            + "Steve is his competitor . Daniel Smith lead the ceremony. Kristen is svery happpy to know about it. Thomas will u please look into the matter as Ruby is busy";

     WhitespaceTokenizer whitespaceTokenizer = WhitespaceTokenizer.INSTANCE;  

    //Tokenizing the given paragraph 
    String tokens[] = whitespaceTokenizer.tokenize(sentence);  
    for(String str:tokens)
        System.out.println(str);

    InputStream inputStreamNameFinder = new FileInputStream(modelfile);       
    TokenNameFinderModel model = new TokenNameFinderModel(inputStreamNameFinder);

    NameFinderME nameFinder = new NameFinderME(model);    

    Span nameSpans[] = nameFinder.find(tokens);  


     System.out.println(Arrays.toString(Span.spansToStrings(nameSpans, tokens)));

    for(Span s: nameSpans)        
       System.out.println(s.toString()+"  "+tokens[s.getStart()]);  

    }

    }

输出是:

[Pierre Vinken, Vinken, Peter, Steve, Daniel Smith, Kristen, Thomas]

这个受过训练的模型无法识别像Rudolph Agnew和Ruby这样的名字。 如何更准确地训练它,以便能够更正确地识别名称?

2 个答案:

答案 0 :(得分:1)

+1 @ caffeinator13的回答。此外,有一些参数(https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/TrainingParameters.html)(链接是旧版本,但我猜有params仍然在更新版本中),它控制迭代次数和(可能与您更相关)截止,即数字实体必须出现在训练数据中以供考虑的时间。此设置或多或少控制精度与召回,也许您应该设置它更宽松(不确定默认值是什么)。因此,您可以尝试使用默认参数:

TrainingParameters tp = new TrainingParameters();
tp.put(TrainingParameters.CUTOFF_PARAM, "1");
tp.put(TrainingParameters.ITERATIONS_PARAM, "100");
TokenNameFinderFactory tnff = new TokenNameFinderFactory();
model = NameFinderME.train(language, modelName, sampleStream, tp, tnff);

答案 1 :(得分:0)

根据opennlp documentation,训练数据应包含至少15000个句子,以创建表现良好的模型。因此,使用更多数据训练它并尝试给出不同的名称,而不是保持测试数据与训练数据相同!