手动为NLP标记单词

时间:2019-03-19 13:50:22

标签: machine-learning nlp lstm ner

我是机器学习的新手,被称为实体识别,并被分配了一项任务,以数百个段落手动标记我的数据,以重新训练双向LSTM模型。有没有更好的方法,或者我必须仔细阅读全部内容并手动标记每个组织,个人?

2 个答案:

答案 0 :(得分:0)

I'm not quite sure if I understand the question, but its not as if you have to read the entire corpus. Just combine the entire corpus into a set of words, look through that set, and find anything that's an entity. You will need to be careful with how you process the text (eg can't lowercase everything b.c. then Apple -> apple and you miss that entity). Some packages will come with some entities already recognized (like SpaCy already recognizes NATO), but your corpus will probably have some specific entities (this depends on the corpus and the software you use).

答案 1 :(得分:0)

您的问题没有答案。我想您将需要使用某种非监督方法来准备您的监督数据集。

TextRank对您可能很有用。

否则,我建议(在进行了常规预处理之后,例如,减小大小写,标点符号去除等)应用word2vec(或任何类型的单词向量化),然后再应用某种类型的聚类,例如K-means甚至更好DBSCAN。

通过这种方式,您将能够在视觉上分离数据集中的“主题/主题”,然后使用简单的脚本对其进行标记。

希望这是有道理的,并且会有所帮助。