Question

我必须创建一个带有单个参数 word 的函数，并返回文本中 word 之前的单词的平均长度（以字符为单位）。如果单词恰好是文本中出现的第一个单词，则该单词的前一个单词的长度应为零。例如

>>> average_length("the")
4.4
>>> average_length('whale')
False
average_length('ship.')
3.0

这是我到目前为止所写的，

def average_length(word):
    text = "Call me Ishmael. Some years ago - never mind how long..........."
    words = text.split()
    wordCount = len(words)

    Sum = 0
    for word in words:
        ch = len(word)
        Sum = Sum + ch
    avg = Sum/wordCount
    return avg

我知道这根本不对，但我无法正确处理这个问题。这个问题要求我在文本中找到单词的每个实例，当你这样做时，计算文本中紧接在它之前的单词的长度。不是每个单词从开头到那个单词，只有一个。

我还应该提到所有测试只会使用“Moby Dick”的第一段测试我的代码：

“把我叫做以实玛利。几年前 - 没关系多长时间 - 我的钱包里没有钱，也没有什么特别令我感兴趣的，我觉得我会稍微航行一下，看看水中的一部分世界。这是我驱逐脾脏和调节血液循环的一种方式。每当我发现自己的嘴巴变得严峻;每当我的灵魂中充满潮湿，毛躁的十一月;每当我发现自己在棺材仓库前不由自主地停顿在我遇到的每一次葬礼上都提出了后面的内容;特别是每当我的hypos得到我这样的优势时，它需要一个强有力的道德原则来阻止我故意踩到街上，并有条不紊地敲掉别人的帽子 - 然后我尽可能快地出海了。这是我用手枪和球的替代品。凭借哲学上的蓬勃发展，卡托把自己扔在剑上;我悄悄地带上了船。这没什么好奇怪的。如果他们知道，几乎所有在他们学位上的人，无论是时间还是其他人，都非常珍惜与我在海洋中的感情。“

Answer 1

看起来你只需要过一次数据就可以节省大量的计算时间：

from collections import defaultdict
prec = defaultdict(list)
text = "Call me Ishmael. Some years ago..".split()

在列表上创建两个迭代器。我们在第二个上调用next，所以从现在开始，每当我们从迭代器中获取一个元素时，我们就得到一个单词及其后继者。

first, second = iter(text), iter(text)
next(second)

压缩两个迭代器（"abc","def"→"ad", "be", "cf"），我们将第一个单词的长度附加到第二个单词的前导长度列表中。这是有效的，因为我们正在使用defaultdict(list)，它会为任何尚未存在的密钥返回一个空列表。

for one, two in zip(first, second):  # pairwise
    prec[two].append(len(one))

最后，我们可以创建一个新词典，从单词到其前任长度的均值：Sum除以长度。除了这种词典理解，你还可以使用普通的for循环。

# avg_prec_len = {key: sum(prec[key]) / len(prec[key]) for key in prec}
avg_prec_len = {}
for key in prec:
    # prec[key] is a list of lengths
    avg[key] = sum(prec[key]) / len(prec[key])

然后你可以在那本词典中查找。

（如果您使用的是Python 2，请使用izip而不是zip，然后执行from __future__ import division）。

Answer 2

根据您对无导入的要求和简单的方法，以下函数不做任何操作，注释和变量名称应使函数逻辑非常清晰：

def match_previous(lst, word):
    # keep matches_count of how many times we find a match and total lengths
    matches_count = total_length_sum = 0.0
    # pull first element from list to use as preceding word
    previous_word = lst[0]
    # slice rest of words from the list 
    # so we always compare two consecutive words
    rest_of_words = lst[1:]
    # catch where first word is "word" and add 1 to matches_count
    if previous_word == word:
        matches_count += 1
    for current_word in rest_of_words:
        # if the current word matches our "word"
        # add length of previous word to total_length_sum
        # and increase matches_count.
        if word == current_word:
            total_length_sum += len(previous_word)
            matches_count += 1
        # always update to keep track of word just seen
        previous_word = current_word
    # if  matches_count is 0 we found no word in the text that matched "word"
    return total_length_sum / matches_count if matches_count else False

需要两个参数，即分割的单词列表和要搜索的单词：

In [41]: text = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to previous_wordent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I acmatches_count it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."

In [42]: match_previous(text.split(),"the")
Out[42]: 4.4

In [43]: match_previous(text.split(),"ship.")
Out[43]: 3.0

In [44]: match_previous(text.split(),"whale")
Out[44]: False

In [45]: match_previous(text.split(),"Call")
Out[45]: 0.0

你显然可以和你自己的函数做同样的事情，只需要一个arg就可以在函数中做拆分文本。返回False的唯一方法是，如果我们找不到该单词的匹配项，您可以看到调用返回0.0，因为它是文本中的第一个单词。

如果我们在代码中添加一些打印件并使用枚举：

def match_previous(lst, word):
    matches_count = total_length_sum = 0.0
    previous_word = lst[0]
    rest_of_words = lst[1:]
    if previous_word == word:
        print("First word matches.")
        matches_count += 1
    for ind, current_word in enumerate(rest_of_words, 1):
        print("On iteration {}.\nprevious_word = {} and current_word = {}.".format(ind, previous_word, current_word))
        if word == current_word:
            total_length_sum += len(previous_word)
            matches_count += 1
            print("We found a match at index {} in our list of words.".format(ind-1))
        print("Updating previous_word from {} to {}.".format(previous_word, current_word))
        previous_word = current_word
    return total_length_sum / matches_count if matches_count else False

使用一个小样本列表运行它，我们可以看到会发生什么：

In [59]: match_previous(["bar","foo","foobar","hello", "world","bar"],"bar")
First word matches.
On iteration 1.
previous_word = bar and current_word = foo.
Updating previous_word from bar to foo.
On iteration 2.
previous_word = foo and current_word = foobar.
Updating previous_word from foo to foobar.
On iteration 3.
previous_word = foobar and current_word = hello.
Updating previous_word from foobar to hello.
On iteration 4.
previous_word = hello and current_word = world.
Updating previous_word from hello to world.
On iteration 5.
previous_word = world and current_word = bar.
We found a match at index 4 in our list of words.
Updating previous_word from world to bar.
Out[59]: 2.5

使用iter的优点是我们不需要通过对剩余部分进行切片来创建新列表，以便在您需要将函数的开头更改为的代码中使用它：

def match_previous(lst, word):
    matches_count = total_length_sum = 0.0
    # create an iterator
    _iterator = iter(lst)
    # pull first word from iterator
    previous_word = next(_iterator)
    if previous_word == word:
        matches_count += 1
    # _iterator will give us all bar the first word we consumed with  next(_iterator)
    for current_word in _iterator:

每次从迭代器中使用元素时，我们都会移动到下一个元素：

In [61]: l = [1,2,3,4]

In [62]: it = iter(l)

In [63]: next(it)
Out[63]: 1

In [64]: next(it)
Out[64]: 2
# consumed two of four so we are left with two
In [65]: list(it)
Out[65]: [3, 4]

dict真正有意义的唯一方法是，如果你使用*args对函数执行多个单词：

def sum_previous(text):
    _iterator = iter(text.split())
    previous_word = next(_iterator)
    # set first k/v pairing with the first word
    # if  "total_lengths" is 0 at the end we know there
    # was only one match at the very start
    avg_dict = {previous_word: {"count": 1.0, "total_lengths": 0.0}}
    for current_word in _iterator:
        # if key does not exist, it creates a new key/value pairing
        avg_dict.setdefault(current_word, {"count": 0.0, "total_lengths": 0.0})
        # update value adding word length and increasing the count
        avg_dict[current_word]["total_lengths"] += len(previous_word)
        avg_dict[current_word]["count"] += 1
        previous_word = current_word
    # return the dict so we can use it outside the function.
    return avg_dict


def match_previous_generator(*args):
    # create our dict mapping words to sum of all lengths of their preceding words.
    d = sum_previous(text)
    # for every word we pass to the function.
    for word in args:
        # use dict.get with a default of an empty dict.
        #  to catch when a word is not in out text.
        count = d.get(word, {}).get("count")
        # yield each word and it's avg or False for non existing words.
        yield (word, d[word]["total_lengths"] / count if count else False)

然后只需传入您要搜索的文字和所有字词，即可在generator function上调用列表：

In [69]: list(match_previous_generator("the","Call", "whale", "ship."))
Out[69]: [('the', 4.4), ('Call', 0.0), ('whale', False), ('ship.', 3.0)]

或迭代它：

In [70]: for tup in match_previous_generator("the","Call", "whale", "ship."):
   ....:     print(tup)
   ....:     
('the', 4.4)
('Call', 0.0)
('whale', False)
('ship.', 3.0)

Answer 3

我建议将此任务拆分为某些原子部分：

from __future__ import division  # int / int should result in float

# Input data:
text = "Lorem ipsum dolor sit amet dolor ..."
word = "dolor"

# First of all, let's extract words from string
words = text.split()

# Find indices of picked word in words
indices = [i for i, some_word in enumerate(words) if some_word == word]

# Find indices of preceding words
preceding_indices = [i-1 for i in indices]

# Find preceding words, handle first word case
preceding_words = [words[i] if i != -1 else "" for i in preceding_indices]

# Calculate mean of words length
mean = sum(len(w) for w in preceding_words) / len(preceding_words)

# Check if result is correct
# (len('ipsum') + len('amet')) / 2 = 9 / 2 = 4.5
assert mean == 4.5

显然我们可以将其包装起来。我在这里发表评论：

def mean_length_of_preceding_words(word, text):
    words = text.split()
    indices = [i for i, some_word in enumerate(words) if some_word == word]
    preceding_indices = [i-1 for i in indices]
    preceding_words = [words[i] if i != -1 else "" for i in preceding_indices]
    mean = sum(len(w) for w in preceding_words) / len(preceding_words)
    return mean

显然性能不是关键所在 - 我试图只使用内置函数（from __future__...是我认为的内置函数），并保持中间步骤清晰且不言自明。

一些测试用例：

assert mean_length_of_preceding_words("Lorem", "Lorem ipsum dolor sit amet dolor ...") == 0.0
assert mean_length_of_preceding_words("dolor", "Lorem ipsum dolor sit amet dolor ...") == 4.5
mean_length_of_preceding_words("E", "A B C D")  # ZeroDivisionError - average length of zero words does not exist

如果你想以某种方式处理标点符号，应该调整拆分过程（words = ...）。规范没有提到它，所以我保持简单明了。

我不喜欢更改特殊情况的退货类型，但如果你坚持，你可以提前退出。

def mean_length_of_preceding_words(word, text):
    words = text.split()
    if word not in words:
        return False
    indices = [i for i, some_word in enumerate(words) if some_word == word]
    preceding_indices = [i-1 for i in indices]
    preceding_words = [words[i] if i != -1 else "" for i in preceding_indices]
    mean = sum(len(w) for w in preceding_words) / len(preceding_words)
    return mean

上一个测试用例更改为：

assert mean_length_of_preceding_words("E", "A B C D") is False

Answer 4

这个答案是基于你想要删除所有标点符号只是单词的假设......

我玩脏字会在字词列表前加上一个空字符串，这样就可以满足你对第一个文字的前身的要求。

使用numpy使得一切智能索引能够计算结果。

class Preceding_Word_Length():
    def __init__(self, text):
        import numpy as np
        self.words = np.array(
            ['']+[w.strip(''',.?!'":''') for w in text.split() if w != '-'])
        self.indices = np.arange(len(self.words))
        self.lengths = np.fromiter((len(w) for w in self.words), float)
    def mean(self, word):
        import numpy as np
        if word not in self.words:
            return 0.0
        return np.average(self.lengths[self.indices[word==self.words]-1])

text = '''Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.'''

ishmael = Preceding_Word_Length(text)

print(ishmael.mean('and'))   # -> 6.28571428571
print(ishmael.mean('Call'))  # -> 0.0
print(ishmael.mean('xyz'))   # -> 0.0

我想强调的是，在一个类中实现这种行为会导致一种简单的方法来缓存一些重复的计算，以便对同一文本进行连续分析。

Answer 5

与我之前的回答非常相似，没有导入numpy

def average_length(text, word):
    words = ['']+[w.strip(''',.?!'":''') for w in text.split() if w != '-']
    if word not in words: return False
    match = [len(prev) for prev, curr in zip(words[:-1],words[1:]) if curr==word]
    return 1.0*sum(match)/len(match)

前面的字长

5 个答案: