如何在Pandas中使用groupby包含缺失的项目?

时间:2018-04-04 09:05:36

标签: python pandas

假设我有一个包含以下列的数据框:日期,时间,日期,月份,年份,描述,价格,类型,制造商

使用pandas和value_counts(),我可以获得列中每个唯一项目的计数:

df.manufacturer.value_counts()

另外,使用groupby我可以获得数据中每天的平均价格:

df.groupby("day").price.mean()

问题是共有7天,但在我的数据中可能只有5或6天,所以我需要添加缺失的天数,其中均值为零或无。

一般情况下,如果我有一个特定的列表,当我执行value_counts或groupby操作时,如何包含缺少的项?

2 个答案:

答案 0 :(得分:0)

我认为您可以将天数转换为categorical s,因此如果使用groupby + mean获取NaN个缺少的类别:

df = pd.DataFrame({
    'day': ['Monday','Tuesday','Tuesday','Tuesday','Thursday'],
    'price': list(range(5))
})
print (df)
        day  price
0    Monday      0
1   Tuesday      1
2   Tuesday      2
3   Tuesday      3
4  Thursday      4

cats = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df['day'] = pd.Categorical(df['day'], categories=cats, ordered=True)
print(df.groupby("day", as_index=False).price.mean())
         day  price
0     Monday    0.0
1    Tuesday    2.0
2  Wednesday    NaN
3   Thursday    4.0
4     Friday    NaN
5   Saturday    NaN
6     Sunday    NaN

所有可能类别的另一个解决方案是reindex

cats = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

print(df.groupby("day").price.mean().reindex(cats))
day
Monday       0.0
Tuesday      2.0
Wednesday    NaN
Thursday     4.0
Friday       NaN
Saturday     NaN
Sunday       NaN
Name: price, dtype: float64

print(df.groupby("day").price.mean().reindex(cats, fill_value=0))
day
Monday       0
Tuesday      2
Wednesday    0
Thursday     4
Friday       0
Saturday     0
Sunday       0
Name: price, dtype: int64

答案 1 :(得分:0)

您应该使用Categorical Data。以下是一个最小的例子。

import pandas as pd

df = pd.DataFrame([['Monday', 5], ['Monday', 6], ['Monday', 3],
                   ['Tuesday', 1], ['Tuesday', 8]],
                  columns=['day', 'value'])

# list days in order
day_list = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# convert to categorical
df['day'] = df['day'].astype('category')

# set categories and ordered=True
df['day'] = df['day'].cat.set_categories(day_list, ordered=True)

# perform groupby and fillna with 0
res = df.groupby('day').mean().fillna(0)

结果:

              value
day                
Monday     4.666667
Tuesday    4.500000
Wednesday  0.000000
Thursday   0.000000
Friday     0.000000
Saturday   0.000000
Sunday     0.000000