从输入NLP句子中提取关键字的最佳方法

时间:2014-12-10 16:22:05

标签: python machine-learning nlp

我正在开展一个项目,我需要从句子中提取重要的关键字。我一直在使用基于POS标签的基于规则的系统。但是,我遇到了一些我无法解析的含糊不清的术语。是否有一些机器学习分类器可以用来根据不同句子的训练集提取相关的关键词?

5 个答案:

答案 0 :(得分:5)

查看RAKE:这是一个相当不错的小型Python库。

编辑:我还找到了a tutorial on how to get started with it

答案 1 :(得分:1)

也可以尝试这种多语言RAKE实现-适用于任何语言。

可以与pip install multi-rake

一起安装
from multi_rake import Rake

text_en = (
    'Compatibility of systems of linear constraints over the set of '
    'natural numbers. Criteria of compatibility of a system of linear '
    'Diophantine equations, strict inequations, and nonstrict inequations '
    'are considered. Upper bounds for components of a minimal set of '
    'solutions and algorithms of construction of minimal generating sets '
    'of solutions for all types of systems are given. These criteria and '
    'the corresponding algorithms for constructing a minimal supporting '
    'set of solutions can be used in solving all the considered types of '
    'systems and systems of mixed types.'
)

rake = Rake()

keywords = rake.apply(text_en)

print(keywords[:10])

#  ('minimal generating sets', 8.666666666666666),
#  ('linear diophantine equations', 8.5),
#  ('minimal supporting set', 7.666666666666666),
#  ('minimal set', 4.666666666666666),
#  ('linear constraints', 4.5),
#  ('natural numbers', 4.0),
#  ('strict inequations', 4.0),
#  ('nonstrict inequations', 4.0),
#  ('upper bounds', 4.0),
#  ('mixed types', 3.666666666666667)

答案 2 :(得分:1)

如果从整个语料库中提取关键字很重要,则此代码段可能有助于基于idf值提取单词。我们将在20个新闻组数据集中的无神论类别中提取关键字。可能不是您的选择:)

## THE CODE IS SELF EXPLANATORY AND COMMENTED 

## loading some dependencies
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk
nltk.download('wordnet')
from sklearn.feature_extraction.text import TfidfVectorizer

## our dataset
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train' , shuffle = True , categories =  [ "alt.atheism" ])
## defining a stemmer to use
stemmer = SnowballStemmer("english")

## this dictiaoniary will come in handy later on ..
stemmed_to_original = {}

## Basic Preprocessings Functions ##
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :

        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            stemmed_token = lemmatize_stemming(token)
            stemmed_to_original[stemmed_token] = token
            result.append(stemmed_token)
            
    return result


news_data = [ preprocess(i) for i in newsgroups_train.data  ]
## notice, min_df and max_df parameters are really important in getting the most important keywords out of your corpus
vectorizer = TfidfVectorizer(   stop_words= gensim.parsing.preprocessing.STOPWORDS , min_df = 20 , max_df = 0.72, tokenizer= lambda x : x , lowercase= False   )
vectorizer.fit_transform( news_data  )

## get idf values of all the corresponding tokens used by vectorizer and sort them in ascending order
## Depends on how you define it, but for most of cases while working in text corpus,  after unnecessary stopwords and  ( really high / really rare ) frequent words have been filtered out
## by parameters we used in our vectorizer above,  this type of sorting gets you important keywords

## make a dictionairy of words and corresponding idf weight
word_to_idf = {  i:j for i,j in zip(vectorizer.get_feature_names() , vectorizer.idf_ ) }
## sort the dictionairy in ascending order of idf weights
word_to_idf = sorted(   word_to_idf.items() ,key = lambda x : x[1]  ,  reverse = False )
print(word_to_idf)

打印前N个结果

for k,v in word_to_idf[:5]:
    print( '{} ---> {} ----> {}'.format( k , stemmed_to_original[k] , v    )  ) 

让我们看看最佳搜索结果

如果我们在删除新闻标题和称呼时更加谨慎,我们可以避免使用诸如post,art​​icle,host之类的词。但没关系

post ---> posting ----> 1.4392949726265691
articl ---> article ----> 1.4754236967150747
host ---> host ----> 1.7035965964342865
nntp ---> nntp ----> 1.7248288165400607
think ---> think ----> 1.8287597393882924
peopl ---> people ----> 1.887600239411226
know ---> know ----> 1.994083719813676
univers ---> universe ----> 1.994083719813676
atheist ---> atheists ----> 2.011081296182247
like ---> like ----> 2.016811970891232
thing ---> things ----> 2.094462905121298
time ---> time ----> 2.199133527685187
mean ---> means ----> 2.2271073797275927
believ ---> believe ----> 2.2705924916673315

答案 3 :(得分:0)

我们也可以使用gensim从给定文本中提取关键字

from gensim.summarization import keywords


text_en = text_en = (
    'Compatibility of systems of linear constraints over the set of'
    'natural numbers. Criteria of compatibility of a system of linear '
    'Diophantine equations, strict inequations, and nonstrict inequations '
    'are considered. Upper bounds for components of a minimal set of '
    'solutions and algorithms of construction of minimal generating sets '
    'of solutions for all types of systems are given. These criteria and '
    'the corresponding algorithms for constructing a minimal supporting '
    'set of solutions can be used in solving all the considered types of '
    'systems and systems of mixed types.')

print(keywords(text_en,words = 10,scores = True, lemmatize = True))

输出将是:

[('numbers', 0.31009020729627595),
('types', 0.2612797117033426),
('upper', 0.26127971170334247),
('considered', 0.2539581373644024),
('minimal', 0.25089449987505835),
('sets', 0.2508944998750583),
('inequations', 0.25051980840329924),
('linear', 0.2505198084032991),
('strict', 0.23778663563992564),
('diophantine', 0.23778663563992555)]

答案 4 :(得分:0)

尝试TfidfVectorizer中的sklearn

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

这给出了语料库中的关键字。您还可以获取关键字的得分,获取前n个关键字等。

  

输出

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

在上面的输出停用词中,例如“ is”和“ the”的出现是因为语料库很小。使用大型语料库,您可以按优先级顺序获得最重要的关键字。请检查TfidfVectorizer以获得更多说明。