计算两个文件之间的字符差异数

时间:2014-05-30 17:28:02

标签: python text scripting

我有两个有点大(~20 MB)的txt文件,它们本质上只是长整数字符串(只有0,1,2)。我想编写一个python脚本,它遍历文件并将它们整数比较整数。在一天结束时,我想要不同的整数数量和文件中的整数总数(它们应该是完全相同的长度)。我做了一些搜索,似乎difflib可能有用,但我对python相当新,我不确定difflib中的任何内容是否会计算差异或条目数。

任何帮助将不胜感激!我现在正在尝试的是以下内容,但它只查看一个条目然后终止,我不明白为什么。

f1 = open("file1.txt", "r")
f2 = open("file2.txt", "r")
fileOne = f1.readlines()
fileTwo = f2.readlines()
f1.close()
f2.close()

correct = 0
x = 0
total = 0
for i in fileOne:
  if i != fileTwo[x]:
    correct +=1
  x += 1
  total +=1

if total != 0:
  percent = (correct / total) * 100
  print "The file is %.1f %% correct!" % (percent)
  print "%i out of %i symbols were correct!" % (correct, total)

2 个答案:

答案 0 :(得分:0)

根本没有测试过,但看看这个更容易(更多Pythonic):

from itertools import izip

with open("file1.txt", "r") as f1, open("file2.txt", "r") as f2:
    data=[(1, x==y) for x, y in izip(f1.read(), f2.read())]

print sum(1.0 for t in data if t[1]) / len(data) * 100    

答案 1 :(得分:0)

您可以使用enumerate检查字符串中与

不匹配的字符

如果所有字符串都保证长度相同:

with open("file1.txt","r") as f:
    l1 = f.readlines()
with open("file2.txt","r") as f:
    l2 = f.readlines()


non_matches = 0. 
total = 0.
for i,j in enumerate(l1):
    non_matches += sum([1 for k,l in enumerate(j) if l2[i][k]!= l]) # add 1 for each non match
    total += len(j.split(","))
print non_matches,total*2
print non_matches / (total * 2) * 100.   # if strings are all same length just mult total by 2

6 40
15.0