使用word2vec提取段落的主要特征

时间:2018-05-16 11:53:33

标签: python word2vec feature-extraction

我刚刚掌握了Google的word2vec模型并且对这个概念很陌生。我试图使用以下方法提取段落的主要功能。

from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('../../usr/myProject/word2vec/GoogleNews-vectors-negative300.bin', binary=True)

...

for para in paragraph_array:
    para_name = "para_"+ file_name + '{0}'
    sentence_array = d[para_name.format(number_of_paragraphs)] = []

    # Split Paragraph on basis of '.' or ? or !.
    for l in re.split(r"\.|\?|\!", para):
        # Split line into list using space.
        sentence_array.append(l)
        #sentence_array.append(l.split(" "))

     print (model.wv.most_similar(positive=para, topn = 1))

但是我得到以下错误,它表示检查的段落不是词汇表中的单词。

  

KeyError:'word \'加纳共和国是西非的一个国家。它与西部的科特迪瓦(也称为象牙海岸),北部的布基纳法索,东部的多哥以及南部的几内亚湾接壤。 “加纳”一词的意思是“战士之王”,杰克逊,约翰G.非洲文明简介,2001年。并且是“几内亚”(通过法国Guinoye)这个名称的来源,用于指代西非海岸(就像在几内亚湾一样。)“不在词汇中”

现在我知道most_similar()函数需要一个数组。但我想知道如何使用word2vec模型将其翻译成一个显示段落主要概念的主要特征或单词。

修饰

我修改了上面的代码,将word_array传递给most_similar()方法,我收到了以下错误。

  

追踪(最近一次通话):     文件“/home/manuelanayantarajeyaraj/PycharmProjects/ChatbotWord2Vec/new_approach.py​​”,第108行,在       print(model.wv.most_similar(positive = word_array,topn = 1))     文件“/home/manuelanayantarajeyaraj/usr/myProject/my_project/lib/python3.5/site-packages/gensim/models/keyedvectors.py”,第361行,在most_similar中       换言之,正面+负面的重量:   ValueError:解包的值太多(预期2)

修改后的实施

for sentence in sentence_array:
    if sentence:
        for w in re.split(r"\.|\?|\!|\@|\#|\$|\%|\^|\&|\*|\(|\)|\-",   sentence):
            split_word = w.split(" ")
            if split_word:
                word_array.append(split_word)
print(model.wv.most_similar(positive=word_array, topn=1))

非常感谢这方面的任何建议。

2 个答案:

答案 0 :(得分:1)

您的错误表明您正在查找整个字符串('The Republic of Ghana is a country in West Africa. It borders Côte d\'Ivoire (also known as Ivory Coast) to the west, Burkina Faso to the north, Togo to the east, and the Gulf of Guinea to the south. The word "Ghana" means "Warrior King", Jackson, John G. Introduction to African Civilizations, 2001. Page 201. and was the source of the name "Guinea" (via French Guinoye) used to refer to the West African coast (as in Gulf of Guinea).'),就好像它是一个单词,并且该单词不存在。

most_similar()方法可以列出一些正面示例,但您必须将该字符串标记为可能位于单词向量集内的单词。 (这可能涉及打破空格和标点符号,以匹配Google为准备该字向量集所做的任何事情。)

在这种情况下,most_similar()会平均所有给定的字词'矢量在一起,并返回其他单词接近平均值。

这是否真正抓住了主要概念'文字不清楚。虽然单词向量可能在识别文本概念时有用,但这不是它们的主要或唯一功能,并且它不是自动的。您可能希望将这组单词过滤到那些以其他方式独特的单词 - 例如总体上不太常见,或者在某些依赖于语料库的度量(如TF / IDF)中具有影响力。

答案 1 :(得分:1)

我重新编写了添加检查点的整个代码,以避免将空字符串存储到每个级别的对象,从段落,句子到单词开始。

工作版

for file_name in files:
    file_identifier = file_name
    file_array = file_dictionary[file_identifier] =[]
    #file_array = file_dictionary[file_name.format((file_count))] = []
    file_path = directory_path+'/'+file_name

    with open(file_path) as f:
        #Level 2 Intents : Each file's main intent (One for each file)
        first_line = f.readline()
        print ()
        print("Level 2 Intent for ", c, " : ", first_line)

        #Level 3 Intents : Each paragraph's main intent (one for each para)

        paragraph_count = 0

        data = f.read()
        splat = data.split("\n")
        paragraph_array = []

        for number, paragraph in enumerate(splat, 1):
            paragraph_identifier = file_name + "_paragraph_" + str(paragraph_count)
            #print(paragraph_identifier)
            paragraph_array = paragraph_dictionary[paragraph_identifier.format(paragraph_count)] = []
            if paragraph :
                paragraph_array.append(paragraph)
            paragraph_count += 1
            if len(paragraph_array) >0 :
                file_array.append(paragraph_array)

            # Level 4 Intents : Each sentence's main intent (one for each sentence)

            sentence_count = 0
            sentence_array = []

            for sentence in paragraph_array:
                for line in re.split(r"\.|\?|\!", sentence):
                    sentence_identifier = paragraph_identifier + "_sentence_" + str(sentence_count)
                    sentence_array = sentence_dictionary[sentence_identifier.format(sentence_count)] = []
                    if line :
                        sentence_array.append(line)
                        sentence_count += 1

                    # Level 5 Intents : Each word with a certain level of prominance (one for each prominant word)

                    word_count = 0
                    word_array = []

                    for words in sentence_array:
                        for word in re.split(r" ", words):
                            word_identifier = sentence_identifier + "_word_" + str(word_count)
                            word_array = word_dictionary[word_identifier.format(word_count)] = []

                            if word :
                                word_array.append(word)
                                word_count += 1

访问字典项的代码

#Accessing any paragraph array can be done as follows
print (paragraph_dictionary['S08_set4_a5.txt.clean_paragraph_4'])

#Accessing any sentence corresponding to a paragraph
print (sentence_dictionary['S08_set4_a5.txt.clean_paragraph_4_sentence_1'])

#Accessing any word corresponding to a sentence
print (word_dictionary['S08_set4_a5.txt.clean_paragraph_4_sentence_1_word_3'])

<强>输出

['Celsius was born in Uppsala in Sweden. He was professor of astronomy at Uppsala University from 1730 to 1744, but traveled from 1732 to 1735 visiting notable observatories in Germany, Italy and France.']
[' He was professor of astronomy at Uppsala University from 1730 to 1744, but traveled from 1732 to 1735 visiting notable observatories in Germany, Italy and France']
['of']