Question

我只是为模式approved="no"点击一些Xliff文件。我有一个Shell脚本和一个Python脚本，性能差异很大（对于一组393个文件，总共3,686,329行，Shell脚本的用户时间为0.1s，Python脚本为6.6s）。

壳牌：grep 'approved="no"' FILE
的Python：

def grep(pattern, file_path):
    ret = False

    with codecs.open(file_path, "r", encoding="utf-8") as f:
        while 1 and not ret:
            lines = f.readlines(100000)
            if not lines:
                break
            for line in lines:
                if re.search(pattern, line):
                    ret = True
                    break
    return ret

使用多平台解决方案提高性能的任何想法？

结果

在应用了一些建议的解决方案之后，这里有几个结果测试在RHEL6 Linux机器上运行，使用Python 2.6.6 工作集：393个Xliff文件，共计3,686,329行数字是用户时间，以秒为单位。

grep_1 （io，加入100,000个文件行）：50s
grep_3 （mmap）：0.7s
Shell版本（Linux grep）：0.130s

Answer 1

Python，作为解释语言与grep的已编译C版本将始终较慢。

除了你的Python实现不与你的grep示例相同。它没有返回匹配的行，它只是测试模式是否匹配任何一行上的字符。更接近的比较是：

grep -q 'approved="no"' FILE

一旦找到匹配就会返回，并且不会产生任何输出。

通过更有效地编写grep()功能，您可以大大加快代码速度：

def grep_1(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        while True:
            lines = f.readlines(100000)
            if not lines:
                return False
            if re.search(pattern, ''.join(lines)):
                return True

这使用io代替codecs，我发现它更快一些。 while循环条件不需要检查ret，只要结果已知，就可以从函数返回。没有必要为每个单独运行re.search() - 只需加入行并执行单个搜索。

以内存使用为代价，你可以试试这个：

import io

def grep_2(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        return re.search(pattern, f.read())

如果内存有问题，您可以mmap该文件并在mmap上运行正则表达式搜索：

import io
import mmap

def grep_3(pattern, file_path):
    with io.open(file_path, "r", encoding="utf-8") as f:
        return re.search(pattern, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))

mmap将有效地从页面中的文件中读取数据，而不会占用大量内存。此外，您可能会发现mmap的运行速度比其他解决方案快。

对每个函数使用timeit表明情况如此：

10 loops, best of 3: 639 msec per loop       # grep()
10 loops, best of 3: 78.7 msec per loop      # grep_1()
10 loops, best of 3: 19.4 msec per loop      # grep_2()
100 loops, best of 3: 5.32 msec per loop     # grep_3()

文件为/usr/share/dict/words，包含约480,000行，搜索模式为zymurgies，位于文件末尾附近。为了比较，当模式接近文件的开头时，例如， abaciscus，时间是：

10 loops, best of 3: 62.6 msec per loop       # grep()
1000 loops, best of 3: 1.6 msec per loop      # grep_1()
100 loops, best of 3: 14.2 msec per loop      # grep_2()
10000 loops, best of 3: 37.2 usec per loop    # grep_3()

再次表明mmap版本最快。

现在将grep命令与Python mmap版本进行比较：

$ time grep -q zymurgies /usr/share/dict/words

real    0m0.010s
user    0m0.007s
sys 0m0.003s

$ time python x.py grep_3    # uses mmap

real    0m0.023s
user    0m0.019s
sys 0m0.004s

考虑到grep的优势，这还不错。

Answer 2

Grep实际上是一个非常聪明的软件，它不仅仅是每行进行正则表达式搜索。它使用Boyer-Moore算法。有关详细信息，请参阅here。

有关更多指针，请参阅here获取grep的python实现。

Answer 3

这里缓慢的另一个原因是在循环中调用re.search。这将重新编译每一行的正则表达式。

尝试改为：

pattern = re.compile(pattern)
while True:
    ...
    if pattern.search(line):
    ...

Python grep代码比命令行的grep

结果

3 个答案: