python - 产量不正确

时间:2016-08-09 13:23:48

标签: python parsing yield gensim

我很确定我使用屈服不正确:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import logging
from gensim import corpora, models, similarities
from collections import defaultdict
from pprint import pprint  # pretty-printer
from six import iteritems
import openpyxl
import string
from operator import itemgetter

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#Creating a stoplist from file
with open('stop-word-list.txt') as f:
    stoplist = [x.strip('\n') for x in f.readlines()]

corpusFileName = 'content_sample_en.xlsx'
corpusSheetName = 'content_sample_en'

class MyCorpus(object):
    def __iter__(self):
        wb = openpyxl.load_workbook(corpusFileName)
        sheet = wb.get_sheet_by_name(corpusSheetName)
        for i in range(1, (sheet.max_row+1)/2):
            title = str(sheet.cell(row = i, column = 4).value.encode('utf-8'))
            summary = str(sheet.cell(row = i, column = 5).value.encode('utf-8'))
            content = str(sheet.cell(row = i, column = 10).value.encode('utf-8'))
            yield reBuildDoc("{} {} {}".format(title, summary, content))


def removeUnwantedPunctuations(doc):
    "change all (/, \, <, >) into ' ' "
    newDoc = ""
    for l in doc:
        if  l == "<" or l == ">" or l == "/" or l == "\\":
            newDoc += " "
        else:
            newDoc += l
    return newDoc

def reBuildDoc(doc):
    """
    :param doc:
    :return: document after being dissected to our needs.
    """
    doc = removeUnwantedPunctuations(doc).lower().translate(None, string.punctuation)
    newDoc = [word for word in doc.split() if word not in stoplist]
    return newDoc

corpus = MyCorpus()

tfidf = models.TfidfModel(corpus, normalize=True)

在以下示例中,您可以看到我尝试从xlsx文件创建语料库。我从xlsx文件中读取3行,这些行是标题摘要和内容,并将它们附加到一个大字符串中。我的reBuildDoc()removeUnwantedPunctuations()函数然后根据我的需要调整文本,最后返回一个大的单词列表。 (对于例如:[hello, piano, computer, etc... ])最后我得出结果,但是我得到以下错误:

Traceback (most recent call last):
  File "C:/Users/Eran/PycharmProjects/tfidf/docproc.py", line 101, in <module>
    tfidf = models.TfidfModel(corpus, normalize=True)
  File "C:\Anaconda2\lib\site-packages\gensim-0.13.1-py2.7-win-amd64.egg\gensim\models\tfidfmodel.py", line 96, in __init__
    self.initialize(corpus)
  File "C:\Anaconda2\lib\site-packages\gensim-0.13.1-py2.7-win-amd64.egg\gensim\models\tfidfmodel.py", line 119, in initialize
    for termid, _ in bow:
ValueError: too many values to unpack

我知道错误来自屈服线,因为我有一个不同的屈服线。它看起来像这样:

 yield [word for word in dictionary.doc2bow("{} {} {}".format(title, summary, content).lower().translate(None, string.punctuation).split()) if word not in stoplist]

它很简单,很难将功能性放在它上面所以我已经改变了它,正如你在第一个例子中所看到的那样。

2 个答案:

答案 0 :(得分:1)

问题不在于yield本身,是产生了什么,错误是来自for termid, _ in bow这一行表示您希望bow包含元组列表或任何其他对象正好包含2个元素,如(1,2),[1,2],"12",...,但最终给它的结果是MyCorpus,这是一个显然超过2个元素的字符串,因此错误,要解决此问题{{1或者for termid in bow执行MyCorpus所以你产生一个2对象的元组

说明这个检查这个例子

yield reBuildDoc("{} {} {}".format(title, summary, content)), None

答案 1 :(得分:1)

您的问题似乎是TfidfModel期望corpus list doc2bowlist输出(tuple个为{2} - {{} 1}}为s)。您的原始工作代码正确使用doc2bow从普通字符串转换为语料库格式,您的新代码传入原始字符串,而不是{vector} TfidfModel期望的传递。

回到使用doc2bowread the tutorial on converting string to vectors,这清楚地表明原始字符串是无意义的输入。