基于大熊猫标准的栏目变更价值

时间:2015-04-19 21:49:00

标签: python pandas

我想用“是”或“否”替换NaN值,具体取决于哪个计数基于“第一”列更大,如果它们相等则使其为“是”。例如,这是我的原始数据帧。

test = pd.DataFrame({'first':['a','a','b','c','b','c','a','c','b','a','b','c','c','d','d','d'],
                     'second':['yes','yes','no','no',np.nan,np.nan,'no','yes',np.nan,np.nan,'yes','no','no',np.nan,np.nan,np.nan]})

test = test.sort(['first'])

test

   first second
1      a    yes
6      a     no
9      a    NaN
0      a    yes
4      b    NaN
10     b    yes
2      b     no
8      b    NaN
5      c    NaN
3      c     no
11     c     no
12     c     no
7      c    yes
14     d    NaN
15     d    NaN
13     d    NaN

我希望我的新数据框是这样的:

  first second
    1      a    yes
    6      a     no
    9      a    yes
    0      a    yes
    4      b    yes
    10     b    yes
    2      b     no
    8      b    yes
    5      c    no
    3      c     no
    11     c     no
    12     c     no
    7      c    yes
    14     d    NaN
    15     d    NaN
    13     d    NaN

2 个答案:

答案 0 :(得分:1)

这是一个选项。从测试框架开始

test = pd.DataFrame({'first':['a','a','b','c','b','c','a','c','b','a','b','c'],
                     'second':['yes','yes','no','no',np.nan,np.nan,'no','yes',np.nan,np.nan,'yes','no']})
test = test.sort(['first'])
test

    first   second
0   a       yes
1   a       yes
6   a       no
9   a       NaN
4   b       NaN
10  b       yes
8   b       NaN
2   b       no
3   c       no
5   c       NaN
11  c       no
7   c       yes

选项1

然后进行一些分组,然后进行排序以创建新的Dataframe(testCounts)。注意:我在第二个"第二个"因此,当计数相等时,将在组中首先出现。

s = test.groupby(['first',"second"])["first"].agg("count")
s.name = "count"
testCounts = s.reset_index().sort(["first","count","second"],ascending=[True,False,False])
testCounts
    first   second  count
1   a       yes     2
0   a       no      1
3   b       yes     1
2   b       no      1
4   c       no      2
5   c       yes     1

然后我们使用布尔索引来过滤NaN。然后我们映射一个lambda函数,它接受我们的布尔索引testCounts的第一行

rowIndex = test["second"].isnull()
test.loc[rowIndex,"second"] = test["first"].map(lambda s : 
                              testCounts[testCounts["first"] == s]["second"].iloc[0])
test

    first   second
0   a       yes
1   a       yes
6   a       no
9   a       yes
4   b       yes
10  b       yes
8   b       yes
2   b       no
3   c       no
5   c       no
11  c       no
7   c       yes

选项2。
从上面的框架开始,我们分组以获得类似于选项1的计数。接下来,我们通过对每个组进行排序,分组和获取第一行来创建一个字典

s = test.groupby(['first',"second"])["first"].agg("count")
s.name = "count"
d = s.reset_index().sort(["first","count","second"],ascending=[True,False,False])
                    .groupby("first").first()["second"].to_dict()
d

{'a': 'yes', 'b': 'yes', 'c': 'no'}

像之前一样的布尔索引,并将dict(d)映射到"第一个"

rowIndex = test["second"].isnull()
test.loc[rowIndex,"second"] = test["first"].map(d)
test
    first   second
0   a       yes
1   a       yes
6   a       no
9   a       yes
4   b       yes
10  b       yes
8   b       yes
2   b       no
3   c       no
5   c       no
11  c       no
7   c       yes

答案 1 :(得分:1)

def replace_na(first_value):
    return test[test['first']==first_value]['second'].fillna(g[first_value].index[0])
pd.concat(map(replace_na,first_index))