潜在语义分析结果

时间:2018-09-06 07:29:09

标签: python scikit-learn svd sklearn-pandas lsa

我正在跟踪LSA的教程,并将示例切换到其他字符串列表,但是我不确定代码是否按预期工作。

当我使用本教程中给出的example-input时,它会产生明智的答案。但是,当我使用自己的输入时,会得到非常奇怪的结果。

为了进行比较,这是example-input的结果:

enter image description here

当我使用自己的示例时,就是这样的结果。同样值得注意的是,我似乎并没有获得一致的结果:

enter image description here

enter image description here

在弄清楚为什么我得到这些结果的任何帮助将不胜感激:)

代码如下:

import sklearn
# Import all of the scikit learn stuff
from __future__ import print_function
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans
import pandas as pd
import warnings
# Suppress warnings from pandas library
warnings.filterwarnings("ignore", category=DeprecationWarning,
module="pandas", lineno=570)
import numpy


example = ["Coffee brewed by expressing or forcing a small amount of 
nearly boiling water under pressure through finely ground coffee 
beans.", 
"An espresso-based coffee drink consisting of espresso with 
microfoam (steamed milk with small, fine bubbles with a glossy or 
velvety consistency)", 
"American fast-food dish, consisting of french fries covered in 
cheese with the possible addition of various other toppings", 
"Pounded and breaded chicken is topped with sweet honey, salty 
dill pickles, and vinegar-y iceberg slaw, then served upon crispy 
challah toast.", 
"A layered, flaky texture, similar to a puff pastry."]

''''
example = ["Machine learning is super fun",
"Python is super, super cool",
"Statistics is cool, too",
"Data science is fun",
"Python is great for machine learning",
"I like football",
"Football is great to watch"]
'''

vectorizer = CountVectorizer(min_df = 1, stop_words = 'english')
dtm = vectorizer.fit_transform(example)
pd.DataFrame(dtm.toarray(),index=example,columns=vectorizer.get_feature_names()).head(10)

# Get words that correspond to each column
vectorizer.get_feature_names()

# Fit LSA. Use algorithm = “randomized” for large datasets
lsa = TruncatedSVD(2, algorithm = 'arpack')
dtm_lsa = lsa.fit_transform(dtm.astype(float))
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)

pd.DataFrame(lsa.components_,index = ["component_1","component_2"],columns = vectorizer.get_feature_names())

pd.DataFrame(dtm_lsa, index = example, columns = "component_1","component_2"])

xs = [w[0] for w in dtm_lsa]
ys = [w[1] for w in dtm_lsa]
xs, ys

# Plot scatter plot of points
%pylab inline
import matplotlib.pyplot as plt
figure()
plt.scatter(xs,ys)
xlabel('First principal component')
ylabel('Second principal component')
title('Plot of points against LSA principal components')
show()

#Plot scatter plot of points with vectors
%pylab inline
import matplotlib.pyplot as plt
plt.figure()
ax = plt.gca()
ax.quiver(0,0,xs,ys,angles='xy',scale_units='xy',scale=1, linewidth = .01)
ax.set_xlim([-1,1])
ax.set_ylim([-1,1])
xlabel('First principal component')
ylabel('Second principal component')
title('Plot of points against LSA principal components')
plt.draw()
plt.show()

# Compute document similarity using LSA components
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * 
numpy.asmatrix(dtm_lsa).T)
pd.DataFrame(similarity,index=example, columns=example).head(10)

1 个答案:

答案 0 :(得分:1)

该问题似乎是由于您正在使用的少量示例以及标准化步骤的结合。因为TrucatedSVD将您的计数向量映射到许多非常小的数字和一个相对较大的数字,所以当您对它们进行归一化时,您会看到一些奇怪的行为。您可以通过查看数据的散点图来看到这一点。

dtm_lsa = lsa.fit_transform(dtm.astype(float))
fig, ax = plt.subplots()
for i in range(dtm_lsa.shape[0]):
    ax.scatter(dtm_lsa[i, 0], dtm_lsa[i, 1], label=f'{i+1}')
ax.legend()

not normalised

我要说的是,这幅图代表了您的数据,因为这两个咖啡示例不在右侧(很难说出很多例子,而其他例子很少)。但是,当您对数据进行规范化

dtm_lsa = lsa.fit_transform(dtm.astype(float))
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
fig, ax = plt.subplots()
for i in range(dtm_lsa.shape[0]):
    ax.scatter(dtm_lsa[i, 0], dtm_lsa[i, 1], label=f'{i+1}')
ax.legend()

normalised

这使彼此之间的某些点彼此重叠,这将为您提供1的相似之处。几乎可以肯定,随着差异的增加,问题就会消失,也就是说,您添加的新样本越多。