搜索大文件的最快方法

时间:2018-01-18 07:20:31

标签: python python-3.x python-2.7 performance

我在python中有一个代码,它读取一个非常大的文件从另一个文件中获取数据并写入一个匹配且不匹配的值的新文件。

比如说

file 1: 
ab
bc
cd
gh

file 2:
ab t1 catch1
ab t1 catch2
bc t1 catch1
bc t2 catch3
bc t1 catch4
ef t7 catch1

output : 
ab catch1 
   catch2
bc catch1
   catch3
   catch4
cd
gh

My Code:
    with open("list_with-detail.ids") as f:
      for line in f:
        if id in line:
          do printing

我正在处理非常大的文件,即~10 GB,每个id需要几分钟才能获取相关数据。要获取的id列表也非常大,即~20 MB。

我想知道更好/更快的方法来处理这个问题。

1 个答案:

答案 0 :(得分:1)

也许不是最有效的,但这是一个简单的纯Python示例。 此示例使用Python dict首先索引数据文件的内容。 然后,索引可用于快速定位和随机读取记录 根据第一档的方式。

请注意,更强大的解决方案可能是将数据加载到适当的数据库,例如sqlite3的。

@Basic(optional=true)

输出:

from collections import defaultdict

# Use a default dict to store a list of file positions found for each key
idx = defaultdict(list)

# Index the contents of the second file
file2 = open('/file2/path')
i = 0
while True:
    # get the current file position
    loc = file2.tell()
    l = file2.readline()
    if not l: break
    k = l.split()[0]
    # Store a list of file positions for each key
    idx[k].append(loc)    
    i += 1

# The idx object could now be serialized to disk for later access.

# Read all second file contents sequentially for each key in the first file
file1 = open('/file1/path')
for l in file1.readlines():
    k = l.split()[0]
    locs = idx.get(k, [])
    print(k)
    for loc in locs:
        # Jump to the indexed file position and read the line
        file2.seek(loc)
        row = file2.readline()
        print('\t', row.strip())