Question

有2列，Label1和Label2。它们都是使用不同方法的集群标签。

  Label1 Label2
0   0    1024
1   1    1024
2   2    1025
3   3    1026
4   3    1027
5   4    1028

我想根据这两列获得最终的群集标签。比较每一行，只要这两个标签中的一个相同，它们就在同一个集群中。

例如：第0行和第1行共有标签2，第3行和第4行共有label1，因此同一组中的row0和row1以及同一组中的row3和row4。所以我想要的结果是：

   Label1 Label2 Cluster ID
0   0    1024    0
1   1    1024    0
2   2    1025    1
3   3    1026    2
4   3    1027    2
5   4    1028    3

最好的方法是什么？任何帮助将不胜感激。

编辑：我想我没有举一个很好的例子。实际上，标签不一定是任何顺序：

  Label1 Label2
0   0    1024
1   1    1023
2   2    1025
3   3    1024
4   3    1027
5   4    1022

Answer 1

IIUC，您可以按如下方式对群集进行分组：

取行和它的下一行之间的差异，用0填充最上面的行，并找到标签[1和2]的累积总和。

In [2]: label1_ = df['Label1'].diff().fillna(0).cumsum()

In [3]: label2_ = df['Label2'].diff().fillna(0).cumsum()

将这些连接到新数据框并分别为标签[1和2]删除重复值。接下来是reset_index以获取默认的整数索引。

In [4]: df_ = pd.concat([label1_, label2_], axis=1).drop_duplicates(['Label1'])    \
                                                   .drop_duplicates(['Label2'])     \
                                                   .reset_index()

将索引值分配给新列Cluster ID。

In [5]: df_['Cluster_ID'] = df_.index

In [6]: df_.set_index('index', inplace=True)

In [7]: df['Cluster_ID'] = df_['Cluster_ID']

用它之前的有限值替换Nan值并将最终答案转换为整数。

In [8]: df.fillna(method='ffill').astype(int)
Out[8]: 
   Label1  Label2  Cluster_ID
0       0    1024           0
1       1    1024           0
2       2    1025           1
3       3    1026           2
4       3    1027           2
5       4    1028           3

Answer 2

试试这个：使用np.where和pandas.duplicated

df             = df.sort_values(['Label1', 'Label2'])
df['Cluster']  = np.where( (df.Label1.duplicated()) | (df.Label2.duplicated()),0,1).cumsum()
print df

       Label1  Label2  Cluster
0       0    1024        1
1       1    1024        1
2       2    1025        2
3       3    1026        3
4       3    1027        3
5       4    1028        4

Answer 3

我不确定我是否正确理解了您的问题，但这是识别群集的可能方法：

import pandas as pd
import collections

df = pd.DataFrame(
    {'Label1': [0, 1, 2, 3, 3, 4], 'Label2': [1024, 1024, 1025, 1026, 1027, 1028]})
df['Cluster ID'] = [0] * 6

counter1 = {k: v for k, v in collections.Counter(
    df['Label1']).iteritems() if v > 1}
counter1 = counter1.keys()
counter2 = {k: v for k, v in collections.Counter(
    df['Label2']).iteritems() if v > 1}
counter2 = counter2.keys()

len1 = len(counter1)
len2 = len(counter2)
index_cluster = len1 + len2

for index, row in df.iterrows():
    if row['Label2'] in counter2:
        df.loc[index, 'Cluster ID'] = counter2.index(row['Label2'])
    elif row['Label1'] in counter1:
        df.loc[index, 'Cluster ID'] = counter1.index(row['Label1']) + len2
    else:
        df.loc[index, 'Cluster ID'] = index_cluster
        index_cluster += 1

print df

Answer 4

以下是如何实现这一点：

检查前一行是否有两列相同的值
如果其中任何一个值相同，请不要增加群集编号并添加到群集列表
如果所有值都不相同，请增加群集编号并添加到群集列表
将群集列表添加为数据框的列。

代码：

import pandas as pd

df=pd.DataFrame([[0,1,2,3,4,5],[0,1,2,3,3,4],[1024,1024,1025,1026,1027,1028]]).T
cluster_num = 0
cluster_list = []
for i,row in df.iterrows():
    if i!=0:
        # check previous row
        if df.loc[i-1][1]==row[1] or df.loc[i-1][2]==row[2]:
            # add to previous cluster
            cluster_list.append(cluster_num)
        else:
            # create new cluster
            cluster_num+=1
            cluster_list.append(cluster_num)
    else:
        cluster_list.append(cluster_num)

#Add the list as column
df.insert(3,3,cluster_list)

根据其他2列确定列值

4 个答案: