Question

我一直在一个管理大量单词列表的项目中工作，并通过大量测试来验证这些单词是否有效。有趣的是，每次我使用像itertools模块这样的“更快”的工具时，它们似乎都会变慢。

最后我决定问这个问题，因为我可能做错了。以下代码将尝试测试any()函数与使用循环的性能。

#!/usr/bin/python3
#

import time
from unicodedata import normalize


file_path='./tests'


start=time.time()
with open(file_path, encoding='utf-8', mode='rt') as f:
    tests_list=f.read()
print('File reading done in {} seconds'.format(time.time() - start))

start=time.time()
tests_list=[line.strip() for line in normalize('NFC',tests_list).splitlines()]
print('String formalization, and list strip done in {} seconds'.format(time.time()-start))
print('{} strings'.format(len(tests_list)))


unallowed_combinations=['ab','ac','ad','ae','af','ag','ah','ai','af','ax',
                        'ae','rt','rz','bt','du','iz','ip','uy','io','ik',
                        'il','iw','ww','wp']


def combination_is_valid(string):
    if any(combination in string for combination in unallowed_combinations):
        return False

    return True


def combination_is_valid2(string):
    for combination in unallowed_combinations:
        if combination in string:
            return False

    return True


print('Testing the performance of any()')

start=time.time()
for string in tests_list:
    combination_is_valid(string)
print('combination_is_valid ended in {} seconds'.format(time.time()-start))


start=time.time()
for string in tests_list:
    combination_is_valid2(string)
print('combination_is_valid2 ended in {} seconds'.format(time.time()-start))

之前的代码非常能代表我所做的测试，如果我们看看结果：

File reading done in 0.22988605499267578 seconds
String formalization, and list strip done in 6.803032875061035 seconds
38709922 strings
Testing the performance of any()
combination_is_valid ended in 80.74802565574646 seconds
combination_is_valid2 ended in 41.69514226913452 seconds


File reading done in 0.24268722534179688 seconds
String formalization, and list strip done in 6.720442771911621 seconds
38709922 strings
Testing the performance of any()
combination_is_valid ended in 79.05265760421753 seconds
combination_is_valid2 ended in 42.24800777435303 seconds

我发现使用循环比使用any()快一半有点惊人。对此有何解释？我做错了吗？

（我在GNU-Linux下使用了python3.4）

Answer 1

实际上any()函数等于以下函数：

def any(iterable):
    for element in iterable:
        if element:
            return True
    return False

这就像你的第二个函数，但由于any()自己返回一个布尔值，你不需要检查结果然后返回一个新值，所以性能的差异是因为您实际上使用了冗余返回和if条件，同时在另一个函数内调用any。

因此any的优势在于你不需要用另一个函数包装它，因为它可以为你完成所有的事情。

同样正如@interjay在评论中提到的那样，我错过的最重要的原因似乎是你将生成器表达式传递给any()，它不会立即提供结果，因为它会产生结果要求它做额外的工作。

基于PEP 0289 -- Generator Expressions

生成器表达式的语义等同于创建匿名生成器函数并调用它。例如：

g = (x**2 for x in range(10))
print g.next()

相当于：

def __gen(exp):
    for x in exp:
        yield x**2
g = __gen(iter(range(10)))
print g.next()

因为你可以看到每次python想要访问下一个项目时它会调用iter函数和生成器的next方法。最后结果是使用{过分杀戮} {1}}在这种情况下。

Answer 2

既然你的真实问题得到了解答，我就会对隐含的问题采取一些措施：

只需执行unallowed_combinations = sorted(set(unallowed_combinations))即可获得免费提速，因为它包含重复项。

鉴于此，我知道这样做的最快方式是

valid3_re = re.compile("|".join(map(re.escape, unallowed_combinations)))

def combination_is_valid3(string):
    return not valid3_re.search(string)

对于CPython 3.5，对于一些行长为60个字符的测试数据，我得到了

combination_is_valid ended in 3.3051061630249023 seconds
combination_is_valid2 ended in 2.216959238052368 seconds
combination_is_valid3 ended in 1.4767844676971436 seconds

其中第三个是正则表达式版本，而在PyPy3上我得到

combination_is_valid ended in 2.2926249504089355 seconds
combination_is_valid2 ended in 2.0935239791870117 seconds
combination_is_valid3 ended in 0.14300894737243652 seconds

FWIW，这与Rust（一种低级语言，如C ++）竞争，实际上在正则表达式方面明显胜出。较短的字符串比CPython更有利于PyPy（例如，4倍CPython，行长度为10），因为开销更重要。

由于CPython的正则表达式运行时只有大约三分之一是循环开销，因此我们得出结论，PyPy的正则表达式实现针对此用例进行了更好的优化。我建议查看是否有一个CPython正则表达式实现，使其与PyPy竞争。

为什么“any（）”运行比使用循环慢？

2 个答案: