熊猫逐行获得独特的价值

时间:2019-07-12 09:38:35

标签: pandas unique

我希望基于跨多列的行逐行获取唯一值,

数据示例:

col_a|col_b|col_c|col_d
-----------------------
apple|null|apple|null
bob|bob|null|bob
chris|chirs|null|null

预期输出:

new_col
-------
apple
bob
chris

4 个答案:

答案 0 :(得分:1)

您可以尝试以下方法:

data['new_col'] = data.stack().groupby(level=0).apply(lambda x: x.unique().tolist())

示例1:

   col_a col_b  col_c col_d
0  apple   NaN  apple   NaN
1    bob   bob    NaN   bob

输出:

   col_a col_b  col_c col_d  new_col
0  apple   NaN  apple   NaN  [apple]
1    bob   bob    NaN   bob    [bob]

示例2:

   col_a col_b  col_c col_d
0  apple   bob  apple   NaN
1    bob   bob    NaN   bob

输出:

  col_a col_b  col_c col_d         new_col
0  apple   bob  apple   NaN  [apple, bob]
1    bob   bob    NaN   bob         [bob]

示例3:

   col_a  col_b  col_c col_d
0  apple    NaN  apple   NaN
1    bob    bob    NaN   bob
2  chris  chris    NaN   NaN

输出:

   col_a  col_b  col_c col_d  new_col
0  apple    NaN  apple   NaN  [apple]
1    bob    bob    NaN   bob    [bob]
2  chris  chris    NaN   NaN  [chris]

答案 1 :(得分:1)

这只是以上答案的另一种形式。尽管我没有对第一个答案进行彻底的测试,但是在本示例中它似乎可以正常工作。 想法是按行使用Apply函数(因此轴= 1)并获得列表中每一行的唯一值。

test = pd.DataFrame({'col1':['apple','bob'],
                     'col2':[np.nan,'bob'],
                     'col3':['apple',np.nan],
                    'col4':[np.nan,'bob']})
test['new_col'] = test.apply(lambda row: row.dropna().unique(),axis=1)

输出

col1    col2    col3    col4    new_col
apple   NaN    apple     NaN    [apple]
bob     bob    NaN       bob    [bob]

答案 2 :(得分:1)

替代方案:

data = pd.DataFrame(
    {
        "col_a": ["apple", "bob"],
        "col_b": [np.nan, "bob"],
        "col_c": ["apple", np.nan],
        "col_d": [np.nan, "bob"],
    }
) 
for i, row in data.iterrows():
    print(row.T[row.T.notnull()].unique())

答案 3 :(得分:1)

我认为一种简单的申请方法是可行的。

lambda row:row[~row.isna()].unique().tolist(), axis=1

此行表示,对于每一行,您将仅保留不等于NaN的值,从中获取唯一值,然后转换为列表。 axis = 1可能是您最初找不到的。 :)

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'a' : [1, 2, 3],
    'b' : [np.nan, 5, 6]
})

df['unique'] = df.apply(lambda row:row[~row.isna()].unique().tolist(), axis=1) 
print(df)
#   a    b      unique
#0  1  NaN       [1.0]
#1  2  5.0  [2.0, 5.0]
#2  3  6.0  [3.0, 6.0]