基于开始和结束日期的分组扩展行

时间:2016-11-16 08:23:33

标签: python pandas

我有以下pandas数据帧:

import numpy as np
import pandas as pd

dfw = pd.DataFrame({"id": ["A", "B"],
                    "start_date": pd.to_datetime(["2012-01-01", "2013-02-13"], format="%Y-%m-%d"),
                    "end_date": pd.to_datetime(["2012-04-17", "2014-11-18"], format="%Y-%m-%d")})

结果:

   end_date id start_date
2012-04-17  A 2012-01-01
2014-11-18  B 2013-02-13

我正在寻找将此数据帧转换为以下数据帧的最有效方法:

dates = np.empty(0, dtype="datetime64[M]")
dates = np.append(dates, pd.date_range(start="2012-01-01", end="2012-06-01", freq="MS").astype("object"))
dates = np.append(dates, pd.date_range(start="2013-02-01", end="2014-12-01", freq="MS").astype("object"))
dfl = pd.DataFrame({"id": np.repeat(["A", "B"], [6, 23]),
                    "counter": np.concatenate((np.arange(0, 6, dtype="float"), np.arange(0, 23, dtype="float"))),
                    "date": pd.to_datetime(dates, format="%Y-%m-%d")})

结果:

counter   date id
0.0 2012-01-01  A
1.0 2012-02-01  A
2.0 2012-03-01  A
3.0 2012-04-01  A
4.0 2012-05-01  A   
0.0 2013-02-01  B
1.0 2013-03-01  B
2.0 2013-04-01  B
3.0 2013-05-01  B
4.0 2013-06-01  B
5.0 2013-07-01  B
6.0 2013-08-01  B
7.0 2013-09-01  B
8.0 2013-10-01  B
9.0 2013-11-01  B
10.0 2013-12-01  B
11.0 2014-01-01  B
12.0 2014-02-01  B
13.0 2014-03-01  B
14.0 2014-04-01  B
15.0 2014-05-01  B
16.0 2014-06-01  B
17.0 2014-07-01  B
18.0 2014-08-01  B
19.0 2014-09-01  B
20.0 2014-10-01  B
21.0 2014-11-01  B
22.0 2014-12-01  B

我到目前为止提出的一个天真的解决方案是以下功能:

def expand(df):
    dates = np.empty(0, dtype="datetime64[ns]")
    ids = np.empty(0, dtype="object")
    counter = np.empty(0, dtype="float")
    for name, group in df.groupby("id"):
        start_date = group["start_date"].min()
        start_date = pd.to_datetime(np.array(start_date, dtype="datetime64[M]").tolist())
        end_date = group["end_date"].min()
        end_date = end_date + pd.Timedelta(1, unit="M")
        end_date = pd.to_datetime(np.array(end_date, dtype="datetime64[M]").tolist())
        tmp = pd.date_range(start=start_date, end=end_date, freq="MS", closed=None).values
        dates = np.append(dates, tmp)
        ids = np.append(ids, np.repeat(group.id.values[0], len(tmp)))
        counter = np.append(counter, np.arange(0, len(tmp)))

    dfl = pd.DataFrame({"id": ids, "counter": counter, "date": dates})
    return dfl

但它不是很快:

%timeit expand(dfw)
100 loops, best of 3: 4.84 ms per loop

1 个答案:

答案 0 :(得分:2)

通常我会尽量避免使用itertuples,但在某些情况下,它可以更直观。如果需要,您可以通过kwargs对端点进行细粒度控制pd.date_range(例如,包括端点)

In [27]: result = pd.concat([pd.Series(r.id,pd.date_range(r.start_date, r.end_date)) for r in dfw.itertuples()]).reset_index()

In [28]: result.columns = ['counter', 'date']

In [29]: result
Out[29]: 
       counter date
0   2012-01-01    A
1   2012-01-02    A
2   2012-01-03    A
3   2012-01-04    A
4   2012-01-05    A
5   2012-01-06    A
..         ...  ...
746 2014-11-13    B
747 2014-11-14    B
748 2014-11-15    B
749 2014-11-16    B
750 2014-11-17    B
751 2014-11-18    B

[752 rows x 2 columns]

In [26]: %timeit pd.concat([pd.Series(r.id,pd.date_range(r.start_date, r.end_date)) for r in dfw.itertuples()]).reset_index()
100 loops, best of 3: 2.15 ms per loop

不太确定使这种速度超快的目的。你通常会一次进行这种扩展。

你想要月开始,所以就是这样。

In [23]: result = pd.concat([pd.Series(r.id,pd.date_range(r.start_date, r.end_date+pd.offsets.MonthBegin(1), freq='MS', closed=None)) for r in dfw.itertuples()]).reset_index()

In [24]: result.columns=['counter', 'date']

In [25]: result
Out[25]: 
      counter date
0  2012-01-01    A
1  2012-02-01    A
2  2012-03-01    A
3  2012-04-01    A
4  2012-05-01    A
5  2013-03-01    B
6  2013-04-01    B
7  2013-05-01    B
8  2013-06-01    B
9  2013-07-01    B
10 2013-08-01    B
11 2013-09-01    B
12 2013-10-01    B
13 2013-11-01    B
14 2013-12-01    B
15 2014-01-01    B
16 2014-02-01    B
17 2014-03-01    B
18 2014-04-01    B
19 2014-05-01    B
20 2014-06-01    B
21 2014-07-01    B
22 2014-08-01    B
23 2014-09-01    B
24 2014-10-01    B
25 2014-11-01    B
26 2014-12-01    B

您可以像这样调整日期

In [17]: pd.Timestamp('2014-01-17')-pd.offsets.MonthBegin(1)
Out[17]: Timestamp('2014-01-01 00:00:00')

In [18]: pd.Timestamp('2014-01-31')-pd.offsets.MonthBegin(1)
Out[18]: Timestamp('2014-01-01 00:00:00')

In [19]: pd.Timestamp('2014-02-01')-pd.offsets.MonthBegin(1)
Out[19]: Timestamp('2014-01-01 00:00:00')