NER:准备训练的数据

时间:2019-02-02 21:34:07

标签: python nlp nltk ner

我想尝试用Conll 2003数据决定NER任务 我已经看到了很多的信息,如何准备数据集火车,但是这一切都不同了,was't全面的。

首先,我将这些数据转换为句子

def read_file(path):
    sentences = []
    sentence = []
    with open(path, "r", encoding="utf-8") as f:
        f = f.read().split("\\n")
        for line in f:
            line = line.strip()
            if line.startswith("b'-DOCSTART-"):
                continue
            elif len(line) == 0:
                if len(sentence) > 0:
                    sentences.append(sentence)
                    sentence = []
                continue
            try:
                sentence.append((" ".join(line.split(" ")[:-3]),
                                          line.split(" ")[-3],
                                          line.split(" ")[-2],
                                          line.split(" ")[-1]))
            except Exception as e:
                print(e, "line: ", line)
        if len(sentence) > 0:
            sentences.append(sentence)

    return sentences

部分输出看起来像

[('EU', 'NNP', 'I-NP', 'I-ORG'),
 ('rejects', 'VBZ', 'I-VP', 'O'),
 ('German', 'JJ', 'I-NP', 'I-MISC'),
 ('call', 'NN', 'I-NP', 'O'),
 ('to', 'TO', 'I-VP', 'O'),
 ('boycott', 'VB', 'I-VP', 'O'),
 ('British', 'JJ', 'I-NP', 'I-MISC'),
 ('lamb', 'NN', 'I-NP', 'O'),
 ('.', '.', 'O', 'O')]

我应该在NER管道的下一步做什么以准备要训练的数据?

0 个答案:

没有答案