从txt文件中读取单词

时间:2017-11-03 13:54:54

标签: python dictionary

我开发了一个代码,负责读取txt文件的单词,在我的情况下" elquijote.txt"然后使用字典{key:value}来显示出现的单词及其出现次数。

例如对于文件" test1.txt"用以下词语:

hello hello hello good bye bye 


 hello 3
 good  1
 bye   2


如果在shell中,我们将以下命令" python readingwords.py text.txt 2" , 将显示文件中包含的那些单词" test1.txt"出现的次数多于我们输入的次数,在本例中为2


hello 3


我的代码工作正常,问题是使用大文件,例如" elquijote.txt"需要很长时间才能完成整个过程。




def contar(aux):
  counts = {}
  for palabra in aux:
    palabra = palabra.lower()
    if palabra not in counts:
      counts[palabra] = 0
    counts[palabra] += 1
  return counts

def main():

  characters = '!?¿-.:;-,><=*»¡'
  aux = []
  counts = {}

  with open(sys.argv[1],'r') as f:
    aux = ''.join(c for c in f.read() if c not in characters)
    aux = aux.split()

  if (len(sys.argv)>3):
    with open(sys.argv[3], 'r') as f:
      remove = "".join(c for c in f.read())
      remove = remove.split()

    #Borrar del archivo  
    for word in aux:  
      if word in remove:

  counts = contar(aux)

  for word, count in counts.items():
    if count > int(sys.argv[2]):
      print word, count

if __name__ == '__main__':


主要功能介绍在&#34; aux&#34;列出那些不包含符号字符的单词,然后从同一列表中删除那些&#34;禁止&#34;从另一个.txt文件加载的单词。


您可以在线测试我的代码: https://repl.it/Nf3S/54 感谢。

3 个答案:

答案 0 :(得分:2)


  • 使用collections.Counter()计算contar()
  • 中的项目
  • 使用string.translate()删除不需要的字符
  • 在计数后从忽略单词列表中弹出项目,而不是从原始文本中删除它们。


# -*- coding: utf-8 -*-
import sys
import os
import collections  

def contar(aux):
    return collections.Counter(aux)

def main():

  characters = '!?¿-.:;-,><=*»¡'
  aux = []
  counts = {}

  with open(sys.argv[1],'r') as f:
    text = f.read().lower().translate(None, characters)
    aux = text.split()

  if (len(sys.argv)>3):
    with open(sys.argv[3], 'r') as f:
      remove = set(f.read().strip().split())
    remove = []

  counts = contar(aux)
  for r in remove:
    counts.pop(r, None)

  for word, count in counts.items():
    if count > int(sys.argv[2]):
      print word, count

if __name__ == '__main__':

答案 1 :(得分:1)

这里有一些效率低下的问题。我已经重写了您的代码以利用其中的一些优化。每个更改的原因都在注释/ doc字符串中:

# -*- coding: utf-8 -*-
import sys
from collections import Counter

def contar(aux):
    """Here I replaced your hand made solution with the
    built-in Counter which is quite a bit faster.
    There's no real reason to keep this function, I left it to keep your code
    interface intact.
    return Counter(aux)

def replace_special_chars(string, chars, replace_char=" "):
    """Replaces a set of characters by another character, a space by default
    for c in chars:
        string = string.replace(c, replace_char)
    return string

def main():
    characters = '!?¿-.:;-,><=*»¡'
    aux = []
    counts = {}

    with open(sys.argv[1], "r") as f:
        # You were calling lower() once for every `word`. Now we only
        # call it once for the whole file:
        contents = f.read().strip().lower()
        contents = replace_special_chars(contents, characters)
        aux = contents.split()

    #Borrar del archivo
    if len(sys.argv) > 3:
        with open(sys.argv[3], "r") as f:
            # what you had here was very ineffecient:
            # remove = "".join(c for c in f.read())
            # that would create an array or characters then join them together as a string.
            # this is a bit silly because it's identical to f.read():
            # "".join(c for c in f.read()) === f.read()
            ignore_words = set(f.read().strip().split())
            """ignore_words is a `set` to allow for very fast inclusion/exclusion checks"""
            aux = (word for word in aux if word not in ignore_words)

    counts = contar(aux)

    for word, count in counts.items():
        if count > int(sys.argv[2]):
            print word, count

if __name__ == '__main__':

答案 2 :(得分:1)


  1. 解析__name__ == 'main'下的命令行参数:通过执行此操作,您可以强制执行代码的模块化,因为它只会在您运行此脚本本身时请求命令行参数,而不是导入函数来自另一个剧本。
  2. 使用正则表达式过滤掉您不想要的字词:使用正则表达式可以让您说出您想要的字符或您想要的字符,以哪个字符为准更短。在这种情况下,硬编码每个你不想要的特殊字符是一个相当繁琐的任务,而不是在简单的正则表达式模式中声明你想要的字符。在下面的脚本中,我使用模式[aA-zZ0-9]+过滤掉不是字母数字的单词。
  3. 在许可之前请求宽恕:由于最小计数命令行参数是可选的,因此它显然并不总是存在。因此,我们可以使用try except块来pythonic尝试将最小计数定义为sys.argv[2]并捕获IndexError的例外以将最小计数默认为{{ 1}}。
  4. Python脚本:



    # sys
    import sys
    # regex
    import re
    def main(text_file, min_count):
        word_count = {}
        with open(text_file, 'r') as words:
            # Clean words of linebreaks and split
            # by ' ' to get list of words
            words = words.read().strip().split(' ')
            # Filter words that are not alphanum
            pattern = re.compile(r'^[aA-zZ0-9]+$')
            words = filter(pattern.search,words)
            # Iterate through words and collect
            # count
            for word in words:
                if word in word_count:
                    word_count[word] = word_count[word] + 1
                    word_count[word] = 1
        # Iterate for output
        for word, count in word_count.items():
            if count > min_count:
                print('%s %s' % (word, count))
    if __name__ == '__main__':
        # Get text file name
        text_file = sys.argv[1]
        # Attempt to get minimum count
        # from command line.
        # Default to 0
            min_count = int(sys.argv[2])
        except IndexError:
            min_count = 0
        main(text_file, min_count)


    hello hello hello good bye goodbye !bye bye¶ b?e goodbye


    python script.py text.txt


    bye 1
    good 1
    hello 3
    goodbye 2


    python script.py text.txt 2