查找文件中“停用词”的总数

时间:2015-10-09 14:59:33

标签: python for-loop readfile readlines

我正在尝试创建一个Python程序,它读取两个文本文件,一个包含一篇文章,另一个包含一个“停用词”列表(每行一个单词)。我想确定我正在使用的特定文本文件中有多少这些“停用词”(包含该文章的每个“停用词”的频率的累计总和)。

我尝试创建嵌套的for循环,以便在我循环遍历包含文章的文件的每一行(外部循环),并在每行中,有一个for循环(内部为循环遍历“停用词”列表,并查看当前行中是否有“停用词”,如果是,则查看频率。最后,我将当前行中的单词添加到累加器的频率,该累加器将跟踪包含文章的文件中找到的停用词的总累计量。

目前,当我运行它时,它表示文件中有0个停用词,这是不正确的。

import string

def main():

    analyzed_file  = open('LearnToCode_LearnToThink.txt', 'r')
    stop_word_file = open('stopwords.txt', 'r')

    stop_word_accumulator = 0

    for analyzed_line in analyzed_file.readlines():

        formatted_line = remove_punctuation(analyzed_line)

        for stop_word_line in stop_word_file.readlines():
            stop_formatted_line = create_stopword_list(stop_word_line)
            if stop_formatted_line in formatted_line:
                stop_word_frequency = formatted_line.count(stop_formatted_line)
                stop_word_accumulator += stop_word_frequency

        print("there are ",stop_word_accumulator, " words")


        stop_word_file.close()
        analyzed_file.close()


def create_stopword_list(stop_word_text):

 clean_words = [] # create an empty list
 stop_word_text = stop_word_text.rstrip() # remove trailing whitespace characters
 new_words = stop_word_text.split() # create a list of words from the text
 for word in new_words: # normalize and add to list
        clean_words.append(word.strip(string.punctuation).lower())
 return clean_words



def remove_punctuation(text):
    clean_words = [] # create an empty list
    text = text.rstrip() # remove trailing whitespace characters
    words = text.split() # create a list of words from the text
    for word in words: # normalize and add to list
        clean_words.append(word.strip(string.punctuation).lower())
    return clean_words


main()

2 个答案:

答案 0 :(得分:0)

你有很多问题:

  1. readlines只会工作一次 - 之后,你就在文件的末尾,它会返回一个空字符串。
  2. 无论如何,为另一个文件中的每一行重新创建停用词列表是非常低效的。
  3. one_list in another_listone_list.count(another_list)不会做您认为他们认为的事情。
  4. 相反,尝试类似:

    stop_words = get_stop_word_list(stop_words_file_name)
    
    stop_word_count = 0
    
    with open(other_file_name) as other_file:  # note 'context manager' file handling
        for line in other_file:
            cleaned_line = clean(line)
            for stop_word in stop_words:
                if stop_word in cleaned_line:
                    stop_word_count += cleaned_line.count(stop_word)
    

    有更有效的方法(使用例如setcollections.Counter s),但这应该可以让你开始。

答案 1 :(得分:0)

您可以使用NLTK来检查停用词并对其进行计数:

from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize 
nltk.download('punkt')

x = r"['Nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura, ché la 
diritta via era smarrita.Ahi quanto a dir qual era è cosa dura esta selva selvaggia 
e aspra e forte che nel pensier rinova la paura! Tant' è amara che poco è più morte; 
ma per trattar del ben ch'i' vi trovai, dirò de l altre cose chi v ho scorte.']"

word_tokens = word_tokenize(x) #splitta i pezzi

stopwords_x = [w for w in word_tokens if w in stopWords]
len(stopwords_x) / len(word_tokens) * 100
相关问题