Question

我正在使用某些交易数据（在交易级别上是唯一的），我想按客户编号分组和汇总。我想看的两个指标是＆＃34;平均支出金额（AMTD）＆＃34;和＆＃34;平均赚取的金额（AMTC）＆＃34;每个客户。但是，我只有一个列用于每笔交易的总金额（amt）。我创建了两个新列，条件是交易金额是正还是负：

df_num = df_num.withColumn('AMTD',when(df_num.amt < 0, df_num.amt).otherwise(np.nan))
df_num = df_num.withColumn('AMTC',when(df_num.amt > 0, df_num.amt).otherwise(np.nan))

我在思考＆＃34; nan＆＃34;如果值不存在则在取平均值时可能会在聚合函数中忽略它;然而，它只是回归＆＃34; nan＆＃34;。我想知道在聚合函数中是否有一种方法可以取平均值并指定＆＃34; nan＆＃34;或0被忽略。

aggregate = {'Freq': 'count', 'amt': 'mean', 'AMTsd': 'std', 'AMTMax': 'max', 'AMTMin':'min', 'AMTD':'mean', 'AMTC':'mean',\
         'Cheque':'sum', 'POS':'sum','Bill':'sum', 'Deposit':'sum','Withdrawal':'sum','Transfer':'sum', 'CDMemo':'sum',\
         'CDBackdated':'sum','Loan':'sum', 'Wire':'sum', 'Other':'sum'}

注意：我在这里使用了一个pyspark数据帧。

Answer 1

我似乎回答了我自己的问题。以下似乎有效：

df_num = df_num.withColumn('AMTD',when(df_num.amt < 0, df_num.amt).otherwise(""))
df_num = df_num.withColumn('AMTC',when(df_num.amt > 0, df_num.amt).otherwise(""))

＆＃34; np.nanmean＆＃34;在groupby聚合函数

1 个答案: