有没有人知道如何提高这部分python代码的速度? 它被设计用于处理小文件(只有几行,并且这非常快)但我想用大文件(大约50Gb和数百万行)运行它。
此代码的主要目标是从文件(.txt)中获取并在输入文件中搜索这些内容,打印输出文件中出现的次数。
以下是代码:infile
,seqList
和out
由optparse确定为代码开头的选项(未显示)
def novo (infile, seqList, out) :
uDic = dict()
rDic = dict()
nmDic = dict()
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
for k in lines:
l1 = k.split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
if l1[4] == "U":
for k in uDic.keys():
if k == l3:
uDic[k] += 1
if l1[4] == "R":
for j in rDic.keys():
if j == l3:
rDic[j] += 1
if l1[4] == "NM":
for h in nmDic.keys():
if h == l3:
nmDic[h] += 1
f = open(out, "w")
f.write("Sample"+"\t"+"R"+"\t"+"U"+"\t"+"NM"+"\t"+"TOTAL"+"\t"+"%R"+"\t"+"%U"+"\t"+"%NM"+"\n")
for i in samples:
U = int()
R = int()
NM = int ()
for k, j in uDic.items():
if k == i:
U = j
for o, p in rDic.items():
if o == i:
R = p
for y,u in nmDic.items():
if y == i:
NM = u
TOTAL = int(U + R + NM)
try:
f.write(i+"\t"+str(R)+"\t"+str(U)+"\t"+str(NM)+"\t"+str(TOTAL)+"\t"+str(float(R) / TOTAL)+"\t"+str(float(U) / TOTAL)+"\t"+str(float(NM) / TOTAL$
except:
continue
f.close()
答案 0 :(得分:1)
处理50 GB文件时,问题不在于如何使其更快,而是如何使其可运行 一点都不。
主要问题是,您将耗尽内存并修改代码以处理文件 没有内存中的所有文件,而是在内存中有一条线,这是必需的。
以下代码中的代码是从两个文件中读取所有行:
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
# at this moment you are likely to run out of memory already
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
#similar loop over `lines` comes later on
你应该推迟阅读这些行,直到最新的可能时刻:
#Create dictionaires with all the samples
with open(seqList, 'r') as RADlist:
for samplelines in RADlist:
sample = sampleline.strip()
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
注意:您要使用line.strip()
还是line.split()
?
这样,您就不必将所有内容保留在内存中。
还有更多优化选项,但是这个选项可以让你起飞和运行。
答案 1 :(得分:1)
如果您提供一些示例输入,这会更容易。因为我还没有对此进行测试,但这个想法很简单 - 只使用迭代器迭代每个文件一次,而不是将整个文件读入内存。使用高效的collections.Counter
对象来处理计数并最小化内部循环:
def novo (infile, seqList, out):
from collections import Counter
import csv
# Count
counts = Counter()
with open(infile, 'r') as infile:
for line in infile:
l1 = line.strip().split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
counts[(l1[4], l3)] += 1
# Produce output
types = ['R', 'U', 'NM']
with open(seqList, 'r') as RADlist, open(out, 'w') as outfile:
f = csv.writer(outfile, delimiter='\t')
f.writerow(types + ['TOTAL'] + ['%' + t for t in types])
for sample in RADlist:
sample = sample.strip()
countrow = [counts((t, sample)) for t in types]
total = sum(countrow)
f.writerow([sample] + countrow + [total] + [c/total for c in countrow])
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
答案 2 :(得分:0)
如果您将脚本转换为函数(它使分析更容易),然后在编写代码时查看它的作用:我建议使用runsnake:runsnakerun
答案 3 :(得分:0)
我会尝试用list&amp;替换你的循环字典理解:
例如,而不是
for i in samples:
uDict[i.replace(" ","")] = 0
尝试:
udict = {i.replace(" ",""):0 for i in samples}
和其他dicts相似
我并不真正关注你的“for k in lines”循环中的内容,但是当你有l1 [4]的某些值时,你只需要l3(和l2)。为什么不在分割和替换之前检查这些值?
最后,不要循环遍历dict的所有键以查看给定元素是否在该dict中,请尝试:
if x in myDict:
myDict[x] = ....
例如:
for k in uDic.keys():
if k == l3:
uDic[k] += 1
可以替换为:
if l3 in uDic:
uDic[l3] += 1
除此之外,尝试分析。
答案 4 :(得分:0)
1)查看分析器并调整占用时间最多的代码。
2)你可以尝试用Cython优化一些方法 - 使用profiler中的数据来修改正确的东西
3)看起来您可以使用计数器而不是dict作为输出文件,输入文件的集合 - 查看它们。
set = set()
from Collections import Counter
counter = Counter() # Essentially a modified dict, that is optimized for counting...
# like counting occurences of strings in a text file
4)如果你正在读取50GB的内存,你将无法将其全部存储在RAM中(我假设谁知道你有什么类型的计算机),所以发电机应该节省你的记忆和时间。
#change list comprehension to generators
samples = (line.strip() for line in RADlist)
lines = (line.strip() for line in infile)