Question

如何读取Python中文件的每一行并检查该行是否在同一文本的另一行中？

我创建了2000个图像的哈希并将其存储在同一个文本文件中。因此，要查找是否存在重复图像，我想要交叉检查生成的所有图像的哈希值。

下面列出了我提取数据的代码，

with open('hash_info.txt') as f: content = f.readlines()

['fbefcfdf961f1919\n', 'aecc9e9696961f2f\n', 'cc1c9c9c1c1e272f\n', 'a4ce9e9e9793134b\n', 'e2e7e7e7e7e7e763\n', 'e64fcbcfcf8f0f27\n', '9c1c3c3c3e1e1e9c\n', 'c8cc9cb43e3c3b1b\n', 'cccd9e9e9e1e1f9f\n', 'ccce9e9e9ece0e4e\n', 'a6a7cbcfcf071736\n', 'f69c9c3c3636373b\n', 'ec9c9cbc3c26272b\n', 'f0cccc9c8c0e3f3b\n', '4c9c363e3e3e1e5d\n', '9c9cbc3e3c3c376f\n', 'f5ccce9e9e9e1f2c\n', 'cccc8c9ccc9ccdca\n', 'dc98ac2c363e5e5f\n', 'f2e7e7e7e7e76746\n', '9a9a1e3e3e3e373f\n', 'cc8c9e9e8ecece8f\n', 'db9f9f1e363e9e9e\n', 'e4cece8e9ececfcf\n', 'cecede9f9bce8f8f\n', 'b8ce4e4e9f1b1b29\n', 'ece6e6e7efcf0d05\n', 'cd8e9696b732163f\n', 'cece9e9ecececfcd\n', 'cc9d9f9f9f8dcdd9\n', '992d2c2c3c3ebe9e\n', 'e6e6cece8f2d2939\n', 'eccfcfcfcf4f6f7d\n', 'e6cecfcfcfefcec6\n', 'edf8e4cecece4e0e\n', 'e9d6e6e7e7a76667\n', 'edcecfcfcfcfcecf\n', 'a5a6c6ce8e0f43c7\n', '3a3e7c7c3d3e3f2f\n', 'cc9c963c361f173f\n', '8c9c9c9d9d9d1a9a\n', 'f0cc8e9e9e9f9d9e\n', '989c3c3c1c2e6e5b\n', 'f0989c1c9e1e1adb\n', 'f09c9c9c9c9e9e9f\n', 'e6ce4e1e86333309\n', 'a6cece9e8f0f0f2f\n', 'e8cccc9cccdc8d8c\n', 'f0ecced6969f0f2d\n', 'e0d89c3c3c3d3d1f\n', 'e6e7c7cfc7c64e4f\n', 'a6cf4b0f0e073739\n', 'cececececccf4b5b\n', 'a6c6cfcfcfc6c6c6\n', 'f0fcf3e3e3e3f303\n', 'f9f2e7e7cbcfcf97\n','fbefcfdf961f1919\n', 'f3e7e5e5e7e5c7c3\n', 'b3e7e7c7c7070f1e\n', 'cb9d97963e3f3325\n', '9b1e2c1c1e1e2e2b\n', '9d9e969f9f9f9f0f\n', 'e6a7a7e7e666666c\n', 'c64e9e9b0b072727\n','fbefcfdf961f1919\n', 'c7cfcfcfcfc7ce86\n', 'e6cecfcfcfc7c745\n', 'e6e6cecececfcfcf\n', 'cbcd9f9f9e1f3a7a\n', 'ccce9ecececec646\n', 'f1c7cfdf9f970325\n', '989d9c1c1e9e9f1f\n', '9c9e1c1e9e9d9c9a\n', '5f3d7656de5d3b1f\n', '5f3d76565e5d3b1f\n']

以下是与上述相同的文本文件：

33393cccde1b3f7b 71fb989ed79f3b79 78b0a3a34c7c3737 67781c5e9fcc1f4c 313c2ccf4b5f5f7f ece8cc9c9696171f f4ec8c9c9c9c1e1e e8cc94b68c9c1ece d89c36161c9c1e3f ecccdacececf6d6d a4cecbcacf87173d f9f3e7ebcbc74707 d9e5c7cbd34b4f4d e4ece6e3cbdb8f1d ccde9a9ecccecfad e6e6ced293d6cfc6 cc8c9c989ccc8e8b f2ccc696cecfcfcf cc8c9a9a9ececfcd cc9c9c9cdc9c9ff3

我是如何解决的

def check_dup(hash):

    f = open('hash_text_file.txt')
    s = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
    if s.find(hash.rstrip()) != -1: #rstrip to remove \n
            print("Duplicate Image")
            return False
    else:
            return True

Answer 1

#Opens the text document    
file=open("Old.txt", "r")
#Reads the text document and splits it into a list with each line being an element
lines=file.read().split("\n")
new_lines=[]
#Iterate over list of lines
for line in lines:
    #If line is not in the empty list of lines( i.e the list that will contain unique lines) add the line to it
    #This makes sure that no line exists twice in the list
    if line not in new_lines:
        new_lines.append(line)
#Open a new text file
file_new=open("New.txt","w")
#Add each line of our new unique lines list to the text file
file_new.write("\n".join(new_lines))
file_new.close()
file.close()

Answer 2

我拿了一些样本数据并清理了＆＃34; \ n＆＃34;从它，转换为设置和测试它们in / not in set：

data = ['fbefcfdf961f1919\n', 'aecc9e9696961f2f\n', 'cc1c9c9c1c1e272f\n', 
        'a4ce9e9e9793134b\n', 'e2e7e7e7e7e7e763\n',]

# create a set from your data, lookups are faster that way 
cleaned = set(x.strip("\n") for x in data) 

for testMe in ['not in it', 'fbefcfdf961f1919']: # supply your list of "new" things
    if testMe in cleaned:
        print "Got a duplicate: " + testMe 
    else:
        print "Unique: " + testMe   
        # append to hash-file
        with open("hash_info.txt","w+") as f: # if you have 1000 new hashes this
            f.write(testMe+"\n")  # reopens the file 1000 times (see below)

要将大量新数据与现有数据进行比较，您还应将新数据放入一组：

newSet = set( ... your data here ... )

并使用set操作来获取cleaned集中尚未包含的内容：

thingsToAddToFile = newSet - cleaned    # substracts from newSet all known ones, only 
                                        # new ones will be in thingsToAddToFile  

# add them all to your exisitng ones by appending them:
with open("hash_info.txt","w+") as f:
    f.write("\n".join(thingsToAddToFile) + "\n") # joins all in set and appends `'\n'` on end

请参阅https://docs.python.org/2/library/sets.html：

x in s                            test x for membership in s
x not in s                        test x for non-membership in s
s.issubset(t)           s <= t    test whether every element in s is in t
s.issuperset(t)         s >= t    test whether every element in t is in s
s.union(t)              s | t     new set with elements from both s and t
s.intersection(t)       s & t     new set with elements common to s and t
s.difference(t)         s - t     new set with elements in s but not in t
s.symmetric_difference(t) 
                        s ^ t     new set with elements in either s or t but not both

如何在整个文本文件中搜索一行文本

2 个答案: