移动水桶和出生/死亡计数

时间:2017-05-25 17:44:48

标签: python pandas

我们假设我有一个这种类型的熊猫数据框(最小例子):

myDf = pd.DataFrame({'user': ['A'','B', 'C', 'D', 'E']*2,'date': ['2017-05-25']*5+['2017-05-26']*5,'nVisits':[10,2,3,0,0,6,0,4,8,1]})

表格如下:

date        nVisits user
5/25/2017   10      A
5/25/2017   2       B
5/25/2017   3       C
5/25/2017   0       D
5/25/2017   0       E
5/26/2017   6       A
5/26/2017   0       B
5/26/2017   4       C
5/26/2017   8       D
5/26/2017   1       E

(1)我想每天将我的用户分类为4个桶:0次访问,1次访问,2-4次访问,5次访问,所以我想创建一个看起来像数据框的摘要这样:

date        group      nVisits  nObs
5/25/2017   zero       0        2
5/25/2017   one        0        0
5/25/2017   twoToFour  2        2
5/25/2017   fivePlus   10       1
5/26/2017   zero       0        1
5/26/2017   one        1        1
5/26/2017   twoToFour  4        1
5/26/2017   fivePlus   16       2

此数据框基本上是每个桶的观察次数以及每个桶的访问次数,用户属于哪个桶每天更新一次。

(2)我想列出所有出生和死亡的客户,其中出生被归类为从0次访问到> 1次访问的客户,以及作为客户的死亡从> 1次访问到0次访问。

在此具体示例中,新数据框将如下所示:

date        event_type  user    nVisitsAtBirthDeath
5/26/2017   death       B       2
5/26/2017   birth       D       8
5/26/2017   birth       E       1

这个数据框基本上是从今天到前一天的比较,用户从0次访问到多次或等于1次访问,以及哪些用户从1次访问次数增加到1次访问次数。

你能帮助我开始以高效和高效的方式开展这项工作吗?我的原始数据帧相对较大,因此在python中执行循环运行速度太慢。

4 个答案:

答案 0 :(得分:4)

我使用pd.cut()方法:

In [29]: df['group'] = pd.cut(df.nVisits,
                              [-1, 0, 1, 4, np.inf], 
                              labels=['zero','one','twoToFour','fivePlus'])

In [30]: df
Out[30]:
         date  nVisits user      group
0  2017-05-25       10    A   fivePlus
1  2017-05-25        2    B  twoToFour
2  2017-05-25        3    C  twoToFour
3  2017-05-25        0    D       zero
4  2017-05-25        0    E       zero
5  2017-05-26        6    A   fivePlus
6  2017-05-26        0    B       zero
7  2017-05-26        4    C  twoToFour
8  2017-05-26        8    D   fivePlus
9  2017-05-26        1    E        one

答案 1 :(得分:2)

一种方法是使用np.where()

myDf [' group'] = np.where(myDf.nVisits> 5,' fiveplus',np.where(myDf.nVisits == 0,'零&#39 ;, np.where(myDf.nVisits == 1,' one',' twotofour')))

    date        nVisits user    group
0   2017-05-25  10      A       fiveplus
1   2017-05-25  2       B       twotofour
2   2017-05-25  3       C       twotofour
3   2017-05-25  0       D       zero
4   2017-05-25  0       E       zero
5   2017-05-26  6       A       fiveplus
6   2017-05-26  0       B       zero
7   2017-05-26  4       C       twotofour
8   2017-05-26  8       D       fiveplus
9   2017-05-26  1       E       one

答案 2 :(得分:2)

解决方案1:

df1 = myDf.assign(group=pd.cut(myDf.nVisits,[0,1,2,5,np.inf],right=False,labels=['zero','one','twotoFour','fivePlus']))

输出:

         date  nVisits user      group
0  2017-05-25       10    A   fivePlus
1  2017-05-25        2    B  twotoFour
2  2017-05-25        3    C  twotoFour
3  2017-05-25        0    D       zero
4  2017-05-25        0    E       zero
5  2017-05-26        6    A   fivePlus
6  2017-05-26        0    B       zero
7  2017-05-26        4    C  twotoFour
8  2017-05-26        8    D   fivePlus
9  2017-05-26        1    E        one

df2 = df1.groupby(['date','group']).agg({'nVisits':'sum','user':'count'}).reset_index()

print(df2)

         date      group  user  nVisits
0  2017-05-25   fivePlus     1       10
1  2017-05-25  twotoFour     2        5
2  2017-05-25       zero     2        0
3  2017-05-26   fivePlus     2       14
4  2017-05-26        one     1        1
5  2017-05-26  twotoFour     1        4
6  2017-05-26       zero     1        0

解决方案2:

df2 = df1.assign(nVisitsAtBirthDeath=df1.groupby('user').filter(lambda x: x.nVisits.eq(0).any()).groupby('user')['nVisits'].apply(lambda x: x - x.shift())).dropna()

df3 = df2.assign(event=np.where(df2.nVisitsAtBirthDeath<0,'Death','Birth'))

print(df3)

输出:

         date  nVisits user     group  nVisitsAtBirthDeath  event
6  2017-05-26        0    B      zero                 -2.0  Death
8  2017-05-26        8    D  fivePlus                  8.0  Birth
9  2017-05-26        1    E       one                  1.0  Birth

答案 3 :(得分:1)

1。 第一项的解决方案

def label(visits):
    if visits == 0:
        return 'zero'
    if visits == 1:
        return 'one'
    if visits < 5:
        return 'twoToFour'
    return 'fivePlus'
myDf['group'] = myDf['nVisits'].apply(label)

2。 第二项的解决方案

myDf['last_day_visits'] = myDf.groupby('user').nVisits.shift(1).fillna(0)
def event_type(row):
    if row['nVisits'] > 0 and row['last_day_visits'] == 0:
        return 'birth'
    if row['nVisits'] == 0 and row['last_day_visits'] > 0:
        return 'death'

myDf['event_type'] = myDf.apply(event_type, axis=1)