Question

我对在熊猫数据框中过滤和提取重复的行感到困惑。例如，考虑：

   col1     col2      col3   col4  col5    ID

    1        yes        0      1      2    201
    2         0         1      0      0    203
    0         0         0      0      1    202
    0         0         0      0      2    202
    1        yes        0      3      4    201

如何在不考虑特定列数的情况下选择所有具有相同关联ID的重复行并将其排列到另一个pandas数据框中，对于本示例，我们假设最后两列（col4和col5）。例如，假设我有(*)：

   col1     col2      col3   col4  col5    ID

    1        yes        0      1      2    201
    1        yes        0      3      4    201
    0         0         0      0      1    202
    0         0         0      0      2    202
    2         0         1      0      0    203

我知道我可以使用duplicated和groupby内置函数来做到这一点。但是，由于我要处理大量的列和行，所以我不知道这是否会返回我根据需要组织的所有重复行。我试图：

在：

temp2 = ['col4','col5']
# I am doing this because I have a lot of columns in my real dataset more than 800
a_lis = list(set(df.columns) - set(temp2))
a_lis

df.groupby(df['ID']).loc[df.duplicated(keep=False, subset=a_lis),:]

退出：

AttributeError: Cannot access callable attribute 'loc' of 'DataFrameGroupBy' objects, try using the 'apply' method

keep参数引起了我的困惑，我完全不了解此参数的工作方式。因此，我的问题是如何正确使用groupby和keep参数来获取(*)

Answer 1

您无需在此处使用groupby。只需使用pd.DataFrame.loc。请记住，groupby用于通过函数聚合数据。但是，您似乎想要reindex并将重复的行放在数据框的顶部。

keep=False将所有重复的行保留在数据框中的其他位置，仅考虑subset中的列。在这种情况下，索引为1的行将被删除。

import numpy as np

# calculate duplicate indices
dup_index = df[df.duplicated(keep=False, subset=a_lis)].sort_values('ID').index

# calculate non-duplicate indices
non_dup_index = df.index.difference(dup_index)

# concatenate and reindex
res = df.reindex(np.hstack((dup_index.values, non_dup_index.values)))

print(res)

   col1 col2  col3  col4  col5   ID
0     1  yes     0     1     2  201
4     1  yes     0     3     4  201
2     0    0     0     0     1  202
3     0    0     0     0     2  202
1     2    0     1     0     0  203

如何过滤熊猫数据框中按索引分组的重复行？

1 个答案: