我有一个数据框,它汇总了几天的数据。我想在缺少的日子里添加
我正在关注另一篇帖子,Add missing dates to pandas dataframe,不幸的是,它覆盖了我的结果(可能功能稍有改变?)......代码在下面
import random
import datetime as dt
import numpy as np
import pandas as pd
def generate_row(year, month, day):
while True:
date = dt.datetime(year=year, month=month, day=day)
data = np.random.random(size=4)
yield [date] + list(data)
# days I have data for
dates = [(2000, 1, 1), (2000, 1, 2), (2000, 2, 4)]
generators = [generate_row(*date) for date in dates]
# get 5 data points for each
data = [next(generator) for generator in generators for _ in range(5)]
df = pd.DataFrame(data, columns=['date'] + ['f'+str(i) for i in range(1,5)])
# df
groupby_day = df.groupby(pd.PeriodIndex(data=df.date, freq='D'))
results = groupby_day.sum()
idx = pd.date_range(min(df.date), max(df.date))
results.reindex(idx, fill_value=0)
答案 0 :(得分:8)
您需要使用period_range
而不是date_range
:
In [11]: idx = pd.period_range(min(df.date), max(df.date))
...: results.reindex(idx, fill_value=0)
...:
Out[11]:
f1 f2 f3 f4
2000-01-01 2.049157 1.962635 2.756154 2.224751
2000-01-02 2.675899 2.587217 1.540823 1.606150
2000-01-03 0.000000 0.000000 0.000000 0.000000
2000-01-04 0.000000 0.000000 0.000000 0.000000
2000-01-05 0.000000 0.000000 0.000000 0.000000
2000-01-06 0.000000 0.000000 0.000000 0.000000
2000-01-07 0.000000 0.000000 0.000000 0.000000
2000-01-08 0.000000 0.000000 0.000000 0.000000
2000-01-09 0.000000 0.000000 0.000000 0.000000
2000-01-10 0.000000 0.000000 0.000000 0.000000
2000-01-11 0.000000 0.000000 0.000000 0.000000
2000-01-12 0.000000 0.000000 0.000000 0.000000
2000-01-13 0.000000 0.000000 0.000000 0.000000
2000-01-14 0.000000 0.000000 0.000000 0.000000
2000-01-15 0.000000 0.000000 0.000000 0.000000
2000-01-16 0.000000 0.000000 0.000000 0.000000
2000-01-17 0.000000 0.000000 0.000000 0.000000
2000-01-18 0.000000 0.000000 0.000000 0.000000
2000-01-19 0.000000 0.000000 0.000000 0.000000
2000-01-20 0.000000 0.000000 0.000000 0.000000
2000-01-21 0.000000 0.000000 0.000000 0.000000
2000-01-22 0.000000 0.000000 0.000000 0.000000
2000-01-23 0.000000 0.000000 0.000000 0.000000
2000-01-24 0.000000 0.000000 0.000000 0.000000
2000-01-25 0.000000 0.000000 0.000000 0.000000
2000-01-26 0.000000 0.000000 0.000000 0.000000
2000-01-27 0.000000 0.000000 0.000000 0.000000
2000-01-28 0.000000 0.000000 0.000000 0.000000
2000-01-29 0.000000 0.000000 0.000000 0.000000
2000-01-30 0.000000 0.000000 0.000000 0.000000
2000-01-31 0.000000 0.000000 0.000000 0.000000
2000-02-01 0.000000 0.000000 0.000000 0.000000
2000-02-02 0.000000 0.000000 0.000000 0.000000
2000-02-03 0.000000 0.000000 0.000000 0.000000
2000-02-04 1.856158 2.892620 2.986166 2.793448
这是因为你的groupby使用PeriodIndex而不是datetime:
df.groupby(pd.PeriodIndex(data=df.date, freq='D'))
您可以使用pd.Grouper
:
df.groupby(pd.Grouper(key="date", freq='D'))
会给出日期时间索引。
答案 1 :(得分:4)
来自评论中的cᴏʟᴅsᴘᴇᴇᴅ提示:
SELECT *
FROM users a
LEFT OUTER JOIN
(
-- retrieve the list of payments for just those payments that are the maxdate per user
SELECT *
FROM (
SELECT payments.*,
MAX(date) OVER (PARTITION BY user_id) maxdate
FROM payments
) max_payments
WHERE date = maxdate
) b ON a.ID = b.user_ID
非常适合这里。
Resample:频率转换和重新采样时间序列的便捷方法。对象必须具有类似日期时间的索引(DatetimeIndex,PeriodIndex或TimedeltaIndex),或者将类似于datetime的值传递给on或level关键字。
resample