使用Python在大文件中搜索多个字符串

时间:2014-01-29 02:41:41

标签: python regex

我正在用Python 2.6编写一个脚本(我是python的新手)。我想要实现的是最有效的方法:

  • 扫描约300,000个.bin文件
  • 每个文件介于500mb和900mb之间
  • 拉出位于每个文件中的2个字符串(它们都位于文件的开头)
  • 将每个文件的输出放在一个.txt文件中

我编写了以下脚本,该脚本有效,但它处理每个文件的速度都很慢。它在过去50分钟左右处理了大约118个文件:

 import re, os, codecs

 path = "./" #will search current directory
 dir_lib = os.listdir(path)

 for book in dir_lib:
    if not book.endswith('.bin'): #only looks for files that have .bin extension
            continue
    file = os.path.join(path, book)
    text = codecs.open(file, "r", "utf-8", errors="ignore") 

    #had to use "ignore" because I kept getting error with binary files: 
    #UnicodeDecodeError: 'utf8' codec can't decode byte 0x9a in position 10: 
    #unexpected code byte

    for lineout in text:
            w = re.search("(Keyword1\:)\s(\[(.+?)\])", lineout)
            d = re.search("Keyword2\s(\[(.+?)\])", lineout)

            outputfile = open('output.txt', 'w')

            if w:
                    lineout = w.group(3) #first keyword that is between the [ ]
                    outputfile.write(lineout + ",")
            elif d:
                    lineout = d.group(2) #second keyword that is between the [ ]
                    outputfile.write(lineout + ";")

           outputfile.close()
    text.close()

我的输出很好,正是我想要的:

 keyword1,keyword2;keyword1,keyword2;etc,...; 

但是这个速度需要大约一个月左右的时间才能连续运行。我可能尝试的其他任何东西,可能是正则表达式的替代品吗?一种方法是它不扫描整个文件,只是在找到关键字之后转到下一个文件?

感谢您的建议。

3 个答案:

答案 0 :(得分:2)

一种方法是在unix操作系统中欺骗和模仿grep,试试http://nedbatchelder.com/code/utilities/pygrep.py

import os

# Get the pygrep script.
if not os.path.exists('pygrep.py'):
    os.system("wget http://nedbatchelder.com/code/utilities/pygrep.py")
from pygrep import grep, Options

# Writes a test file.
text="""This is a text
somehow there are many foo bar in the world.
sometimes they are black sheep, 
sometimes they bar bar black sheep.
most times they foo foo here
and a foo foo there"""
with open('test.txt','w') as fout:
    fout.write(text)

# Here comes the query
queries = ['foo','bar']

opt = Options() # set options for grep.
with open('test.txt','r') as fin:
    for i in queries:
        grep(i, fin, opt)
print

答案 1 :(得分:1)

您可以通过至少三种方式改进代码(按重要性降序排列):

  • 找到两行时,不会突破内部for循环。这意味着尽管事实上在文件开头的某处找到了两行,但脚本将遍历整个文件。
  • 如果所有文件的正则表达式模式相同,则应在外部for循环外编译正则表达式。如果他们在不同文件之间进行更改,请将它们放在内部for循环之外。就目前而言,每次迭代都会创建一个新的regexp对象。

注意:可能不是这种情况,因为recent patterns are cached最多。 (但没有充分的理由这样做)

  • 此外,您不应在每次迭代时打开和关闭输出文件。

以下代码解决了这些问题:

import re, os, codecs

path = "./"
dir_lib = os.listdir(path)
w_pattern = re.compile("(Keyword1\:)\s(\[(.+?)\])")
d_pattern = re.compile("Keyword2\s(\[(.+?)\])")

with open('output.txt', 'w') as outputfile:
    for book in dir_lib:
        if not book.endswith('.bin'):
            continue
        filename = os.path.join(path, book)
        with codecs.open(filename, "r", "utf-8", errors="ignore") as text:
            w_found, d_found = False, False
            for lineout in text:
                w = w_pattern.search(lineout)
                d = d_pattern.search(lineout)
                if w:
                    lineout = w.group(3)
                    outputfile.write(lineout + ",")
                    w_found = True
                elif d:
                    lineout = d.group(2)
                    outputfile.write(lineout + ";")
                    d_found = True
                if w_found and d_found:
                    break

答案 2 :(得分:-1)

一些可能适用或可能不适用的简化:

  • 我假设Keyword1和Keyword2都出现在一行的开头(所以我可以使用re.match而不是re.search)
  • 我假设Keyword1将始终出现在Keyword2之前(所以我可以搜索一个,然后另一个=一半的呼叫):

所以:

import codecs
import glob
import re

START = re.compile("Keyword1\:\s\[(.+?)\]").match
END   = re.compile("Keyword2\:\s\[(.+?)\]").match

def main():
    with open('output.txt', 'w') as outf:
        for fname in glob.glob('*.bin'):
            with codecs.open(fname, 'rb', 'utf-8', errors='ignore') as inf:
                w = None
                for line in inf:
                    w = START(line)
                    if w:
                        break

                d = None
                for line in inf:
                    d = END(line)
                    if d:
                        break

                if w and d:
                    outf.write('{0},{1};'.format(w.group(2), d.group(2)))

if __name__=="__main__":
    main()