使用pandas数据帧获取tfidf的最简单方法是什么?

时间:2016-06-02 13:28:34

标签: python pandas scikit-learn tf-idf gensim

我想从下面的文档中计算tf-idf。我使用的是python和pandas。

import pandas as pd
df = pd.DataFrame({'docId': [1,2,3], 
               'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})

首先,我想我需要为每一行获取word_count。所以我写了一个简单的函数:

def word_count(sent):
    word2cnt = dict()
    for word in sent.split():
        if word in word2cnt: word2cnt[word] += 1
        else: word2cnt[word] = 1
return word2cnt

然后,我将它应用到每一行。

df['word_count'] = df['sent'].apply(word_count)

但现在我输了。我知道如果我使用Graphlab,那么计算tf-idf是一种简单的方法,但我想坚持使用开源选项。 Sklearn和gensim都看起来势不可挡。什么是获得tf-idf最简单的解决方案?

3 个答案:

答案 0 :(得分:24)

Scikit-learn实施非常简单:

from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])

您可以指定很多参数。请参阅文档here

fit_transform的输出将是一个稀疏矩阵,如果你想要它可视化,你可以做x.toarray()

In [44]: x.toarray()
Out[44]: 
array([[ 0.64612892,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.64612892,  0.38161415,  0.38161415,
         0.        ,  0.38161415],
       [ 0.        ,  0.38161415,  0.        ,  0.38161415,  0.38161415,
         0.64612892,  0.38161415]])

答案 1 :(得分:4)

一个简单的解决方案是使用texthero

import texthero as hero
df['tfidf'] = hero.tfidf(df['sent'])
In [5]: df.head()
Out[5]:
   docId                         sent                                              tfidf
0      1   This is the first sentence  [0.3816141458138271, 0.6461289150464732, 0.381...
1      2  This is the second sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...
2      3   This is the third sentence  [0.3816141458138271, 0.0, 0.3816141458138271, ...

答案 2 :(得分:0)

我发现使用sklearn的CountVectorizer稍有不同。 --count矢量化器:Ultraviolet Analysis word frequency -预处理/清理文本:Usman Malik scraping tweets preprocessing 我不会在此答案中涉及预处理。基本上,您要做的就是导入CountVectorizer并将数据适合CountVectorizer对象,这将使您可以访问.vocabulary._items()功能,该功能将为您提供数据集的词汇表(存在的唯一单词及其频率,给定您传递给CountVectorizer的任何限制参数(例如匹配功能编号等)

然后,您将使用Tfidtransformer以类似的方式为这些术语生成tf-idf权重

我正在使用pandas和pycharm ide在jupyter笔记本文件中进行编码

这是一个代码段:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
countVec = CountVectorizer(max_features= 5000, stop_words='english', min_df=.01, max_df=.90)

#%%
#use CountVectorizer.fit(self, raw_documents[, y] to learn vocabulary dictionary of all tokens in raw documents
#raw documents in this case will betweetsFrameWords["Text"] (processed text)
countVec.fit(tweetsFrameWords["Text"])
#useful debug, get an idea of the item list you generated
list(countVec.vocabulary_.items())

#%%
#convert to bag of words
#sparse matrix representation? (README: could use an edit/explanation)
countVec_count = countVec.transform(tweetsFrameWords["Text"])

#%%
#make array from number of occurrences
occ = np.asarray(countVec_count.sum(axis=0)).ravel().tolist()

#make a new data frame with columns term and occurrences, meaning word and number of occurences
bowListFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'occurrences': occ})
print(bowListFrame)

#sort in order of number of word occurences, most->least. if you leave of ascending flag should default ASC
bowListFrame.sort_values(by='occurrences', ascending=False).head(60)

#%%
#now, convert to a more useful ranking system, tf-idf weights
#TfidfTransformer: scale raw word counts to a weighted ranking using the
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
tweetTransformer = TfidfTransformer()

#initial fit representation using transformer object
tweetWeights = tweetTransformer.fit_transform(countVec_count)

#follow similar process to making new data frame with word occurrences, but with term weights
tweetWeightsFin = np.asarray(tweetWeights.mean(axis=0)).ravel().tolist()

#now that we've done Tfid, make a dataframe with weights and names
tweetWeightFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'weight': tweetWeightsFin})
print(tweetWeightFrame)
tweetWeightFrame.sort_values(by='weight', ascending=False).head(20)