汇总pandas数据帧中的数据

时间:2016-03-01 19:00:37

标签: python python-3.x pandas

我的数据框如下:

respondent_id,group_number,member_id
1,1,3
1,1,4
1,2,1
....

我的目标是为每个受访者ID输出两个计数;包含自己作为会员ID的群组数量,以及不包含的群组数量。

例如,上表将输出:

respondent_id,my_groups,other_groups
1,1,1

我最好的猜测是做一些事情:

rg_g = df.groupby(['respondent_id','group_number'])
rg_g.apply(lambda g: g.respondent_id in g.id.values)

但我不知道从哪里去。

2 个答案:

答案 0 :(得分:1)

更新的答案(这不是最好的代码,但它有效):

初​​始化:

test_data = pd.DataFrame(np.random.randint(5, size=(10, 3)),columns=['respondent_id','group_number','member_id'])
test_data['member_id'][3]=None
test_data['member_id'][5]=None
test_data['member_id'][7]=None
test_data['member_id'][8]=None
test_data['member_id'][9]=None
test_data['member_id'][10]=None

代码:

# calculate the groups where respondent have the member_id 
d_nn = test_data[test_data.member_id.notnull()] 
# or for example: test_data[test_data.member_id != 0] 
d_is_n = test_data[test_data.member_id.isnull()]
d_nn = pd.DataFrame({'count' : d_nn.groupby( [ "respondent_id","group_number"] ).size()}).reset_index()
d_is_n = pd.DataFrame({'count' : d_is_n.groupby( [ "respondent_id","group_number"] ).size()}).reset_index()
d_nn['is_member'] = 1
d_is_n['is_member'] = 0


# merge
result = d_nn.copy()
for idx1 in range(len(d_is_n)):
    merge = True
    for idx2 in range(len(d_nn)):
        if d_nn.iloc[idx2].respondent_id == d_is_n.iloc[idx1].respondent_id and \
            d_nn.iloc[idx2].group_number == d_is_n.iloc[idx1].group_number:
            merge = False
    if merge:
        temp_d = d_is_n.iloc[idx1]
        result = result.append(temp_d, ignore_index=True)

#group by respondent_id and is_member
result = pd.DataFrame({'group_number' : result.groupby( [ "respondent_id", "is_member"] ).size()}).reset_index()
print result

答案 1 :(得分:1)

所以,这就是我最终做的事情。也许不理想,但似乎有效。 :)

import pandas as pd
rg = pd.read_csv('./in_file.csv')
rg_g = rg.groupby(['respondent_id','group_number'])
in_g = rg_g.filter(lambda g: g.respondent_id in g.id.values)
out_g = rg_g.filter(lambda g: g.respondent_id not in g.id.values)
my_count = in_g.groupby('respondent_id').group_number.nunique()
other_count = out_g.groupby('respondent_id').group_number.nunique()
pd.concat([my_count,other_count], axis=1).to_csv('./out_file.csv')
相关问题