Question

我正在尝试创建一个简单的程序，以从文件中删除重复的行。但是，我被困住了。我的目标是最终删除除1条重复行以外的所有行，与建议的重复行不同。因此，我仍然有该数据。我也想这样做，它采用相同的文件名并输出相同的文件名。当我尝试使文件名相同时，它只会输出一个空文件。

input_file = "input.txt"
output_file = "input.txt"

seen_lines = set()
outfile = open(output_file, "w")

for line in open(input_file, "r"):
    if line not in seen_lines:
        outfile.write(line)
        seen_lines.add(line)

outfile.close()

input.txt

I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Keep the change ya filthy animal
Did someone say peanut butter?
Did someone say peanut butter?
Keep the change ya filthy animal

预期产量

I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?

Answer 1

无论您做什么，行outfile = open(output_file, "w")都会截断您的文件。随后的读取将找到一个空文件。对于安全执行此操作，我的建议是使用一个临时文件：

打开用于写入的临时文件
将输入处理为新输出
关闭两个文件
将临时文件移动到输入文件名

这比两次打开文件进行读取和写入的功能更强大。如果有任何问题，您将拥有原始作品以及迄今为止所做的任何工作。如果在此过程中出现任何问题，当前的方法可能会使您的文件混乱。

以下是使用tempfile.NamedTemporaryFile和with块的示例，以确保即使出现错误也可以正确关闭所有内容：

from tempfile import NamedTemporaryFile
from shutil import move

input_file = "input.txt"
output_file = "input.txt"

seen_lines = set()

with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
    for line in open(input_file, "r"):
        sline = line.rstrip('\n')
        if sline not in seen_lines:
            output.write(line)
            seen_lines.add(sline)
move(output.name, output_file)

即使输入和输出名称相同，末尾的move仍可以正常工作，因为保证output.name不同于两者。

还请注意，由于最后一行可能没有一行，因此我将从集合中的每一行中删除换行符。

替代解决方案

如果您不关心行的顺序，可以通过直接在内存中完成所有操作来简化此过程：

input_file = "input.txt"
output_file = "input.txt"

with open(input_file) as input:
    unique = set(line.rstrip('\n') for line in input)
with open(output_file, 'w') as output:
    for line in unique:
        output.write(line)
        output.write('\n')

您可以将其与

进行比较

with open(input_file) as input:
    unique = set(line.rstrip('\n') for line in input.readlines())
with open(output_file, 'w') as output:
    output.write('\n'.join(unique))

第二个版本执行完全相同的操作，但是一次加载并全部写入。

Answer 2

问题是您试图写入要读取的文件。您至少有两个选择：

选项1

使用不同的文件名（例如 input.txt 和 output.txt ）。从某种程度上讲，这是最简单的。

选项2

从输入文件中读取所有数据，关闭该文件，然后打开该文件进行写入。

with open('input.txt', 'r') as f:
    lines = f.readlines()

seen_lines = set()
with open('input.txt', 'w') as f:
    for line in lines:
        if line not in seen_lines:
            seen_lines.add(line)
            f.write(line)

选项3

使用r+模式打开文件以进行读取和写入。在这种情况下，您需要注意在写入之前读取要处理的数据。如果您在一个循环中完成所有操作，则循环迭代器可能会失去跟踪。

Answer 3

import os
seen_lines = []

with open('input.txt','r') as infile:
    lines=infile.readlines()
    for line in lines:
        line_stripped=line.strip()
        if line_stripped not in seen_lines:
            seen_lines.append(line_stripped)

with open('input.txt','w') as outfile:
    for line in seen_lines:
        outfile.write(line)
        if line != seen_lines[-1]:
            outfile.write(os.linesep)

输出：

I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?

Answer 4

我相信这是您想要做的最简单的方法：

with open('FileName.txt', 'r+') as i:
    AllLines = i.readlines()
    for line in AllLines:
        #write to file

Answer 5

尝试使用下面的代码，对str.join和set和sorted使用列表理解：

input_file = "input.txt"
output_file = "input.txt"
seen_lines = []
outfile = open(output_file, "w")
infile = open(input_file, "r")
l = [i.rstrip() for i in infile.readlines()]
outfile.write('\n'.join(sorted(set(l,key=l.index))))
outfile.close()

Answer 6

如果您恰巧能够使用Python3，请付我2美分。它使用：

具有方便的Path方法的可重用write_text()对象。
一种OrderedDict作为数据结构，可以同时满足唯一性和顺序的约束。
生成器表达式而不是Path.read_text()来保存在内存中。

# in-place removal of duplicate lines, while remaining order
import os
from collections import OrderedDict
from pathlib import Path

filepath = Path("./duplicates.txt")

with filepath.open() as _file:
    no_duplicates = OrderedDict.fromkeys(line.rstrip('\n') for line in _file)

filepath.write_text("\n".join(no_duplicates))

如何删除重复的行

6 个答案:

选项1

选项2

选项3