使用spacy的逐点相互信息

时间:2019-04-20 19:11:15

标签: spacy

因此,我最近一直在尝试与NLP合作,并决定从事一个涉及情绪分析的项目。我一直在关注http://www.cse.yorku.ca/~aan/research/paper/Emo_WI10.pdf这一特殊的研究。

但是,由于这个原因,我无法理解如何实现第三节E部分(PMI)。我不知道如何建立我的语料库,或者窗口大小是多少以及如何确定它们中应该包含什么。我正在使用 Spacy ,因此获取之前部分的信息并不难。任何解释或帮助将不胜感激。

1 个答案:

答案 0 :(得分:0)

A lot of NLP methods for "meaning" or "semantic" similarity rely on the assumption that words that show up in similar places have similar meanings. For example, "I pet the dog" and "I pet the cat" - we maybe would assume dogs and cats have similar meaning.

spaCy uses something called embeddings which are trained based on thousands of documents (usually news articles or wikipedia pages) using that same idea. word2vec models delete a word from a sentence, look at the word before and after the newly created gap, and try to train a model to correctly predict the missing word. The result of a model such as this is a word embedding.

Embeddings are numerical representation of words. Using these numerical representations, we can calculate the distance or similarity between two words or sentences. The common method for this is to calculate the cosine similarity of two embedding vectors.

I'm not certain about PMI with spaCy but you can calculate semantic similarity in spaCy using the method I described above.

import spacy
nlp = spacy.load('en_core_web_lg')

doc1 = nlp('assisted living communities near me')
doc2 = nlp('list of assisted living facilities')
doc3 = nlp('free puppy and kitty adoption')
print(doc1.similarity(doc2))    # 0.8091
print(doc1.similarity(doc3))    # 0.4659