TF-IDF由字符串行而不是整个文本文档

时间:2015-04-08 10:32:34

标签: python scikit-learn tf-idf

我已经将TF-IDF实现为一个简单的程序,但是想要计算每行的TF-IDF而不是整个文件。

我使用过from sklearn.feature_extraction.text import TfidfVectorizer并查看以下链接作为示例tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

这是我的代码:

from sklearn.feature_extraction.text import TfidfVectorizer

f1 = open('testDB.txt','r')
a = []  
for line in f1:
    a.append(line.strip())
f1.close()

f2 = open('testDB1.txt','r')
b = []
for line in f2:
    b.append(line.strip())
f2.close()

for i in range(min(len(a), len(b))):
    vectorizer = TfidfVectorizer(min_df=1)
    X = vectorizer.fit_transform(a, b)
    idf = vectorizer.idf_
    print dict(zip(vectorizer.get_feature_names(), idf))

文本文件包括:

testDB.txt =
hello my name is tom
epping is based just outside of london football
epping football club is really bad

testDB1.txt = 
hello my name is tom
i live in chelmsford and i play football
chelmsford is a lovely city

输出:

{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'tom': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'zain': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'tom': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}

正如您所看到的,它为两个文本文件而不是每行的整个文档执行TF-IDF。我试图使用for循环实现每行,但我无法弄清楚问题。

理想情况下,输出会在每行打印TF-IDF。例如

u'hello': 0.23123, u'my': 0.3123123, u'name': '0.2313213, u'is': 0.3213132, u'tom': 0.3214344

如果有人可以帮助我或提供任何可能很棒的建议。

1 个答案:

答案 0 :(得分:1)

嗯......这里你传递的是a和b:

for i in range(min(len(a), len(b))):
    vectorizer = TfidfVectorizer(min_df=1)
    X = vectorizer.fit_transform(a, b)
    idf = vectorizer.idf_
    print dict(zip(vectorizer.get_feature_names(), idf))

当a和b是数组时...(字符串列表)。你能做的是:

for i in range(min(len(a), len(b))):
    vectorizer = TfidfVectorizer(min_df=1)
    X = vectorizer.fit_transform(a[i], b[i])
    idf = vectorizer.idf_
    print dict(zip(vectorizer.get_feature_names(), idf))

但正如评论中提到的那样,目前尚不清楚你在做什么......

相关问题