使用python创建n-gram词云

时间:2017-07-19 12:31:51

标签: python scikit-learn n-gram word-cloud

我正在尝试使用bi-gram生成文字云。我能够生成前30个有辨别力的单词但在绘图时无法一起显示单词。我的词云图像看起来仍然像单一的云。我使用了以下脚本和sci-kit学习包。

def create_wordcloud(pipeline): 
"""
Create word cloud with top 30 discriminative words for each category
"""

class_labels = numpy.array(['Arts','Music','News','Politics','Science','Sports','Technology'])

feature_names =pipeline.named_steps['vectorizer'].get_feature_names() 
word_text=[]

for i, class_label in enumerate(class_labels):
    top30 = numpy.argsort(pipeline.named_steps['clf'].coef_[i])[-30:]

    print("%s: %s" % (class_label," ".join(feature_names[j]+"," for j in top30)))

    for j in top30:
        word_text.append(feature_names[j])
    #print(word_text)
    wordcloud1 = WordCloud(width = 800, height = 500, margin=10,random_state=3, collocations=True).generate(' '.join(word_text))

    # Save word cloud as .png file
    # Image files are saved to the folder "classification_model" 
    wordcloud1.to_file(class_label+"_wordcloud.png")

    # Plot wordcloud on console
    plt.figure(figsize=(15,8))
    plt.imshow(wordcloud1, interpolation="bilinear")
    plt.axis("off")
    plt.show()
    word_text=[]

这是我的管道代码

pipeline = Pipeline([

# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(2, 2),sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf',       LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3))
])

这些是我为类别" Arts"

获得的一些功能
Arts: cosmetics businesspeople, television personality, reality television, television presenters, actors london, film producers, actresses television, indian film, set index, actresses actresses, television actors, century actors, births actors, television series, century actresses, actors television, stand comedian, television personalities, television actresses, comedian actor, stand comedians, film actresses, film actors, film directors

1 个答案:

答案 0 :(得分:1)

我认为你需要以某种方式将feature_names中的n-gramms与除空格之外的任何其他符号相结合。例如,我提出了下划线。 现在,这部分让你的n-gramms再次分开单词,我想:

' '.join(word_text)

我认为你必须在下面用空心替换空格:

word_text.append(feature_names[j])

改为:

word_text.append(feature_names[j].replace(' ', '_'))