Question

我正在使用大数据集，因此我只想使用最常用的项目。

数据集的简单示例：

1 2 3 4 5 6 7
1 2
3 4 5
4 5
4
8 9 10 11 12 13 14
15 16 17 18 19 20

4有4次出现，
1有2次出现，
2有2次出现，
5有2次出现，

我希望能够使用最常见的项目生成新的数据集，在这种情况下最常见的是4个：

想要的结果：

1 2 3 4 5
1 2
3 4 5
4 5
4

我找到了前50个最常见的项目，但我没有以正确的方式打印出来。（我的输出导致相同的数据集）

这是我的代码：

 from collections import Counter

with open('dataset.dat', 'r') as f:
    lines = []
    for line in f:
        lines.append(line.split())
    c = Counter(sum(lines, []))
    p = c.most_common(50);

with open('dataset-mostcommon.txt', 'w') as output:
    ..............

有人可以帮助我如何实现它吗？

Answer 1

您必须再次迭代数据集，并且对于每一行，仅显示最常见数据集中的那些。

如果输入行已排序，您可以只进行一次设置交集并按排序顺序打印。如果不是，请迭代您的行数据并检查每个项目

for line in dataset:
    for element in line.split()
        if element in most_common_elements:
            print(element, end=' ')
    print()

PS：对于Python 2，在脚本之上添加from __future__ import print_function

Answer 2

根据文档， c.most-common 返回元组列表，您可以获得所需的输出，如下所示：

with open('dataset-mostcommon.txt', 'w') as output:
    for item, occurence in p:
        output.writelines("%d has %d occurrences,\n"%(item, occurence))

查找数据集中最常用的项目

2 个答案: