Question

我有用于二进制分类的数据集，如下所示：

group_id    pos_in_group    ...    target
...         ...                    ...
172          0                      0
172          1                      0
172          2                      1
172          3                      0
172         ...                    ...
172         719                     0

碰巧，在组中只有一个记录可以具有target == 1，而更有可能在第一位置发生。但是用于预测的模型不能解决这个问题。因此，一组中可能有几条预测为target == 1的记录。

group_id    pos_in_group    ...    target
...         ...                    ...
172          0                      0
172          1                      0
172          2                      1
172          3                      0
172          4                      1
172          5                      0
172         ...                    ...
172         719                     0

使用df[df['target'] == 1].groupby(['group_id'])['pos'].min()，我可以在每个组中首次出现target == 1。如何使用target == 0分配给每个组中位置较高的所有记录？

此外，如何使用1 / df.groupby(['group_id'])['target'].sum()来按不同的值缩放每个组中的列？

Answer 1

如果我对您的理解正确。对于第一个问题，您可以使用df.groupby.min，然后用target有条件地填充np.where列：

target_min = df[df.target == 1].groupby('group_id').pos_in_group.min()

df['target'] = np.where(df.pos_in_group.isin(target_min), 1, 0)

print(df)
   group_id  pos_in_group  target
0       172             0       0
1       172             1       0
2       172             2       1
3       172             3       0
4       172             4       0
5       172             5       0
6       172           719       0

使用group_by中的信息进行DataFrame转换

1 个答案: