我希望我可以获得帮助,使我的代码更高效地运行。我的代码的目的是取出第一个ID(RUID),并根据ID的密钥文件将其替换为去识别ID(RESPID)。输入数据文件是一个大的制表符分隔文本文件,大约2.5GB。数据非常广泛,每行有数千列。我有一个有效的功能,但在实际数据上它非常慢。我的第一个文件已经运行了4天,只有1.4GB。我不知道我的代码的哪一部分是最有问题的,但我怀疑它是我将这些行重新组合在一起并分别编写每一行的地方。任何关于如何改进这一点的建议都将非常感激,4天的处理时间太长了!谢谢!
def swap():
#input files
infile1 = open(r"Z:\ped_test.txt", 'rb')
keyfile = open(r"Z:\ruid_respid_test.txt", 'rb')
#output file
outfile=open(r"Z:\ped_testRESPID.txt", 'wb')
# create dictionary of RUID-RESPID
COLUMN = 1 #Column containing RUID
RESPID={}
for k in keyfile:
kList = k.rstrip('\r\n').split('\t')
if kList[0] not in RESPID and kList[0] != "":
RESPID[kList[0]]=kList[1]
#print RESPID
print "creating RESPID-RUID xwalk dictionary is done"
print "Start creating new file"
print str(datetime.datetime.now())
count=0
for line in infile1:
#if not re.match('#', line): #if there is a header
sline = line.split()
#slen = len(sline)
RUID = sline[COLUMN]
#print RUID
C0 = sline[0]
#print C0
DAT=sline[2:]
for key in RESPID:
if key==RUID:
NewID=RESPID[key]
row=str(C0+'\t'+NewID)
for a in DAT:
row=row+'\t'+a
#print row
outfile.write(row)
outfile.write('\n')
infile1.close()
keyfile.close()
outfile.close()
print "All Done: RESPID replacement is complete"
print str(datetime.datetime.now())
答案 0 :(得分:0)
您无需迭代RESPID。 替换:
for key in RESPID:
if key==RUID:
NewID=RESPID[key]
与
NewId = RESPID[RUID]
它做同样的事情,因为密钥总是RUID。 我很确定这会大大减少程序的运行时间,因为RESPID很大,而且你检查每个键的次数和" ped_test.txt"中的行数一样多。
答案 1 :(得分:0)
你有几个地方可以加快速度。主要是,当你可以使用'get'函数来读取值时,枚举RESPID中的所有键是一个问题。但是由于你的线条很宽,所以还有其他一些可能会有所不同的推文。
def swap():
#input files
infile1 = open(r"Z:\ped_test.txt", 'rb')
keyfile = open(r"Z:\ruid_respid_test.txt", 'rb')
#output file
outfile=open(r"Z:\ped_testRESPID.txt", 'wb')
# create dictionary of RUID-RESPID
COLUMN = 1 #Column containing RUID
RESPID={}
for k in keyfile:
kList = k.split('\t', 2) # minor: jut grab what you need
if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first
RESPID[kList[0]]=kList[1]
#print RESPID
print "creating RESPID-RUID xwalk dictionary is done"
print "Start creating new file"
print str(datetime.datetime.now())
count=0
for line in infile1:
#if not re.match('#', line): #if there is a header
sline = line.split('\t', 2) # minor: just grab what you need
#slen = len(sline)
RUID = sline[COLUMN]
#print RUID
C0 = sline[0]
#print C0
DAT=sline[2:]
# the biggie, just use a lookup
#for key in RESPID:
# if key==RUID:
# NewID=RESPID[key]
rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]])
#row=str(C0+'\t'+NewID)
#for a in DAT:
# row=row+'\t'+a
#print row
outfile.write(row)
outfile.write('\n')
infile1.close()
keyfile.close()
outfile.close()
print "All Done: RESPID replacement is complete"
print str(datetime.datetime.now())