Question

我想从我的文件中删除重复的订单项，需要根据几个字段检查重复项。

myfile.txt的

productItem1 ProductName11,ProdutctPrice27,ProductModelHP11,10/06/2016,ProductDescription-abc1,,,,,,01/11/2017
productItem2 ProductName12,ProdutctPrice99,ProductModelHP12,10/06/2016,ProductDescription-abc2,,,,,,09/02/2017
productItem3 ProductName13,ProdutctPrice87,ProductModelHP13,10/06/2016,ProductDescription-abc3,,,,,,09/02/2017
productItem1 ProductName11,ProdutctPrice27,ProductModelHP11,10/06/2016,ProductDescription-abc1,,,,,,01/12/2017
productItem1 ProductName11,ProdutctPrice27,ProductModelHP11,10/06/2016,ProductDescription-abc1,,,,,,01/11/2017
productItem2 ProductName13,ProdutctPrice991,ProductModelHP123,10/06/2016,ProductDescription-abc3,,,,,,09/02/2017

如上例所示，我想消除重复记录 - 在这种情况下，productItem1有重复记录。我想删除基于这些字段的重复（ProductName11，ProdutctPrice27,10 / 06/206，它是索引0，索引1和索引3）。

我想保留最近约会的记录。在此示例中，01/12/2017是productItem1的更长日期。

我的情况是密钥可以具有相同的值，例如：productItem2但是我提到的字段索引0，索引1和索引3是不同的，所以它不应该被视为重复。

我们如何在Python中消除

输出应为：newFile.txt

productItem2 ProductName12,ProdutctPrice99,ProductModelHP12,10/06/2016,ProductDescription-abc2,,,,,,09/02/2017
productItem3 ProductName13,ProdutctPrice87,ProductModelHP13,10/06/2016,ProductDescription-abc3,,,,,,09/02/2017
productItem1 ProductName11,ProdutctPrice27,ProductModelHP11,10/06/2016,ProductDescription-abc1,,,,,,01/12/2017
productItem2 ProductName13,ProdutctPrice991,ProductModelHP123,10/06/2016,ProductDescription-abc3,,,,,,09/02/2017

消除重复记录的优雅方法是什么？我试过shell脚本，但是没有给我预期的输出。

如果有人可以帮助我们用pythonic方式解决，我们将非常感激

Answer 1

首先，您尝试从shell中执行所做的事情会发生什么？ UNIX（和Linux）具有执行此操作的uniq命令。

在Python中，解决方案取决于您的需求。您是否必须保留记录的原始顺序？如果没有，那么您可以简单地将每一行（作为字符串）添加到集合中。当您点击文件末尾时，只需将该集写入目标文件即可。

如果您需要保留订单，请使用已显示的项目维护一个集合。对于每一行，如果项目不在集合中，则将其写入目标文件并将其添加到集合中。如果以前见过，什么也不做。

Answer 2

我对python也很陌生，但我可以尝试给你一些关于去哪里的建议，老实说，不是100％确定我可以提供帮助，但我会尝试。

您可以使用for循环来比较列表中的每个项目与每个列表中的其他项目，如果找到它，请使用replace（）函数，并且应该删除它。我希望这有帮助

Answer 3

你可以试试这个：

from collections import defaultdict
import itertools
import re
data = [re.split('[,\s]', i) for i in open('filename.txt').read().split("\n")][1:-1]
d = defaultdict(list)
for i in data:
   d[i[0]].append(i[1:])

new_data = {a:[(c, list(d)) for c, d in itertools.groupby(sorted(b, key=lambda x:x[-1]), key=lambda x:x[-1])] for a, b in d.items()}
new_final_data = {a:min(b, key=lambda x:len(x[-1])) if len(b) > 1 else b[-1] for a, b in new_data.items()}
final_list = []
for a, b in new_final_data.items():
     temp1_data = [' '.join(',' if not c else c for c in d) for d in b[-1]]
     for c in temp1_data:
         final_list.append(a+" "+c)

print('\n'.join(final_list))

输出：

productItem1 ProductName11 ProdutctPrice27 ProductModelHP11 10/06/2016 ProductDescription-abc1 , , , , , 01/12/2017
productItem2 ProductName12 ProdutctPrice99 ProductModelHP12 10/06/2016 ProductDescription-abc2 , , , , , 09/02/2017
productItem2 ProductName13 ProdutctPrice991 ProductModelHP123 10/06/2016 ProductDescription-abc3 , , , , , 09/02/2017
productItem3 ProductName13 ProdutctPrice87 ProductModelHP13 10/06/2016 ProductDescription-abc3 , , , , , 09/02/2017

如何从文件中删除重复的订单项？

3 个答案: