Question

我正在从一本介绍性的Python教科书中学习python，并且我遇到了以下问题：

您将实现函数index（），该函数将文本文件的名称和单词列表作为输入。对于列表中的每个单词，您的函数将在文本文件中找到单词出现的行并打印相应的行号。

前：

 >>>> index('raven.txt', ['raven', 'mortal', 'dying', 'ghost', 'ghastly', 'evil', 'demon'])

 ghost     9 
 dying     9 
 demon     122
 evil      99, 106
 ghastly   82
 mortal    30 
 raven     44, 53, 55, 64, 78, 97, 104, 111, 118, 120

以下是我对此问题的尝试：

def index(filename, lst):
    infile = open(filename, 'r')
    lines =  infile.readlines()
    lst = []
    dic = {}
    for line in lines:
        words = line.split()
        lst. append(words)
    for i in range(len(lst)):
        for j in range(len(lst[i])):
            if lst[i][j] in lst:
                dic[lst[i][j]] = i 
    return dic

当我运行该函数时，我得到一个空字典。我不明白为什么我得到一本空字典。那我的功能有什么问题？谢谢。

Answer 1

试试这个，

def index(filename, lst):
    dic = {w:[] for w in lst}
    for n,line in enumerate( open(filename,'r') ):
        for word in lst:
            if word in line.split(' '):
                dic[word].append(n+1)
    return dic

这里介绍的语言有一些你应该注意的功能，因为从长远来看，它们会让生活变得更轻松。

首先是词典理解。它基本上使用lst中的单词作为键来初始化字典，并使用空列表[]作为每个键的值。

接下来是enumerate命令。这允许我们迭代序列中的项目，但也给我们这些项目的索引。在这种情况下，因为我们将文件对象传递给enumerate，它将循环遍历这些行。对于每次迭代，n将是该行的从0开始的索引，line将是该行本身。接下来，我们迭代lst中的单词。

请注意，我们这里不需要任何索引。 Python鼓励循环遍历序列中的对象，而不是循环遍历索引，然后基于索引访问序列中的对象（例如，不鼓励执行for i in range(len(lst)): do something with lst[i])。

最后，in运算符是一种非常简单的方法来测试许多类型对象的成员资格，语法非常直观。在这种情况下，我们要求的是当前lst中来自line的当前单词。

请注意，我们使用line.split(' ')来获取该行中单词的列表。如果我们不这样做，'the' in 'there was a ghost'将返回True，因为the是其中一个字词的子字符串。

另一方面，'the' in ['there', 'was', 'a', 'ghost']将返回False。如果条件返回True，我们将它附加到与字典中的键相关联的列表中。

这可能有很多东西可以咀嚼，但这些概念使得这样的问题更加直截了当。

Answer 2

您正在覆盖lst的值。您将它用作函数的参数（在这种情况下，它是字符串列表）和文件中的单词列表（在这种情况下，它是字符串列表的列表）。当你这样做时：

if lst[i][j] in lst

比较始终返回False，因为lst[i][j]是str，但lst仅包含字符串列表，而不包含字符串本身。这意味着永远不会执行dic的分配，结果会得到一个空的dict。

为避免这种情况，您应该为存储单词的列表使用不同的名称，例如：

In [4]: !echo 'a b c\nd e f' > test.txt

In [5]: def index(filename, lst):
   ...:     infile = open(filename, 'r')
   ...:     lines =  infile.readlines()
   ...:     words = []
   ...:     dic = {}
   ...:     for line in lines:
   ...:         line_words = line.split()
   ...:         words.append(line_words)
   ...:     for i in range(len(words)):
   ...:         for j in range(len(words[i])):
   ...:             if words[i][j] in lst:
   ...:                 dic[words[i][j]] = i 
   ...:     return dic
   ...: 

In [6]: index('test.txt', ['a', 'b', 'c'])
Out[6]: {'a': 0, 'c': 0, 'b': 0}

还有很多事情你可以改变。

如果要迭代列表，则不必显式使用索引。如果您需要索引，可以使用enumerate：

    for i, line_words in enumerate(words):
        for word in line_words:
            if word in lst: dict[word] = i

您也可以直接在文件上进行迭代（有关更多信息，请参阅python教程的Reading and Writing Files部分）：

# use the with statement to make sure that the file gets closed
with open('test.txt') as infile:
    for i, line in enumerate(infile):
        print('Line {}: {}'.format(i, line))

事实上，我不明白你为什么要先建立words列表清单。在构建字典时直接迭代文件：

def index(filename, lst):
    with open(filename, 'r') as infile:
        dic = {}
        for i, line in enumerate(infile):
            for word in line.split():
                if word in lst:
                    dic[word] = i 
    return dic

您的dic值应该是列表，因为多行可以包含相同的单词。就目前而言，dic只存储找到单词的最后一行：

from collections import defaultdict

def index(filename, words):
    # make faster the in check afterwards
    words = frozenset(words)  
    with open(filename) as infile:
        dic = defaultdict(list)
        for i, line in enumerate(infile):
            for word in line.split():
                if word in words:
                    dic[word].append(i)
    return dic

如果您不想使用collections.defaultdict，可以将dic = defaultdict(list)替换为dic = {}，然后更改：

dic[word].append(i)

使用：

if word in dic:
    dic[word] = [i]
else:
    dic[word].append(i)

或者，您也可以使用dict.setdefault：

dic.setdefault(word, []).append(i)

虽然最后一种方法比原始代码慢一点。

请注意，所有这些解决方案都具有以下属性：如果在文件中找不到单词，则它根本不会出现在结果中。但是，您可能希望在结果中使用emty列表作为值。在这种情况下，在开始循环之前使用空列表的dict更简单，例如：

dic = {word : [] for word in words}
for i, line in enumerate(infile):
    for word in line.split():
        if word in words:
            dic[word].append(i)

请参阅有关List Comprehensions和Dictionaries的文档以了解第一行。

你也可以迭代words而不是行，如下所示：

dic = {word : [] for word in words}
for i, line in enumerate(infile):
    for word in words:
        if word in line.split():
            dic[word].append(i)

但请注意，这会更慢，因为：

line.split()会返回一个列表，因此word in line.split()必须扫描所有列表。
您正在重复计算line.split()。

你可以尝试解决这两个问题：

dic = {word : [] for word in words}
for i, line in enumerate(infile):
    line_words = frozenset(line.split())
    for word in words:
        if word in line_words:
            dic[word].append(i)

请注意，我们在line.split()上迭代一次以构建集合，并且还在words上。根据两组的大小，这可能比原始版本更慢或更快（iteratinv超过line.split()）。

然而，此时它可能更快地与集合相交：

dic = {word : [] for word in words}
for i, line in enumerate(infile):
    line_words = frozenset(line.split())
    for word in words & line_words:  # & stands for set intersection
        dic[word].append(i)

Answer 3

首先，带有单词的函数参数名为lst，并且将所有单词放在文件中的列表也命名为lst，因此您不会保存传递给您的单词函数，因为在第4行你要重新声明列表。

其次，您正在迭代文件中的每一行（第一个for），并获取该行中的单词。之后lst包含整个文件中的所有单词。因此，在for i ...中，您正在迭代从文件中获取的所有单词，因此无需使用第三个for j来迭代每个单词中的每个字符。

在简历中，if你说“如果这个单个字符在单词列表中...... ”哪个不是，那么dict将永远不会被填满

for i in range(len(lst)):
  if words[i] in lst:
    dic[words[i]] = dic[words[i]] + i  # To count repetitions

你需要重新思考这个问题，即使我的答案会失败，因为dict中的单词不会存在错误，但你明白了。祝你好运！

为什么我得到一本空字典？

3 个答案: