从稀疏的datetimeindex获取范围

时间:2016-07-07 09:47:53

标签: python datetime pandas dataframe date-range

我在大型数据库中为每个用户提供了这种pandas DataFrame。

enter image description here

每一行都是一段长度[start_date,end_date],但有时连续2行实际上是同一时期:end_date等于以下start_date(红色下划线)。有时候甚至会在超过1个日期重叠。

我想获得"真实时期"通过组合对应于相同时期的行。

我尝试了什么

def split_range(name):
    df_user = de_201512_echant[de_201512_echant.name == name]
    # -- Create a date_range with a length [min_start_date, max_start_date]
    t_date = pd.DataFrame(index=pd.date_range("2005-01-01", "2015-12-12").date)
    for row in range(0, df_user.shape[0]):
        start_date = df_user.iloc[row].start_date
        end_date = df_user.iloc[row].end_date
        if ((pd.isnull(start_date) == False) and (pd.isnull(end_date) == False)):
            t = pd.DataFrame(index=pd.date_range(start_date, end_date))
            t["period_%s" % (row)] = 1
            t_date = pd.merge(t_date, t, right_index=True, left_index=True, how="left")
        else:
            pass

    return t_date

产生一个DataFrame,其中每个colunms是一个句点(如果在范围内则为1,否则为NaN):

t_date
Out[29]: 
            period_0  period_1  period_2  period_3  period_4  period_5  \
2005-01-01       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-02       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-03       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-04       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-05       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-06       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-07       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-08       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-09       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-10       NaN       NaN       NaN       NaN       NaN       NaN   
2005-01-11       NaN       NaN       NaN       NaN       NaN       NaN  

然后如果我总结所有列(句号),我几乎完全得到了我想要的东西:

full_spell = t_date.sum(axis=1)
full_spell.loc[full_spell == 1]

Out[31]: 
2005-11-14    1.0
2005-11-15    1.0
2005-11-16    1.0
2005-11-17    1.0
2005-11-18    1.0
2005-11-19    1.0
2005-11-20    1.0
2005-11-21    1.0
2005-11-22    1.0
2005-11-23    1.0
2005-11-24    1.0
2005-11-25    1.0
2005-11-26    1.0
2005-11-27    1.0
2005-11-28    1.0
2005-11-29    1.0
2005-11-30    1.0
2006-01-16    1.0
2006-01-17    1.0
2006-01-18    1.0
2006-01-19    1.0
2006-01-20    1.0
2006-01-21    1.0
2006-01-22    1.0
2006-01-23    1.0
2006-01-24    1.0
2006-01-25    1.0
2006-01-26    1.0
2006-01-27    1.0
2006-01-28    1.0

2015-07-06    1.0
2015-07-07    1.0
2015-07-08    1.0
2015-07-09    1.0
2015-07-10    1.0
2015-07-11    1.0
2015-07-12    1.0
2015-07-13    1.0
2015-07-14    1.0
2015-07-15    1.0
2015-07-16    1.0
2015-07-17    1.0
2015-07-18    1.0
2015-07-19    1.0
2015-08-02    1.0
2015-08-03    1.0
2015-08-04    1.0
2015-08-05    1.0
2015-08-06    1.0
2015-08-07    1.0
2015-08-08    1.0
2015-08-09    1.0
2015-08-10    1.0
2015-08-11    1.0
2015-08-12    1.0
2015-08-13    1.0
2015-08-14    1.0
2015-08-15    1.0
2015-08-16    1.0
2015-08-17    1.0
dtype: float64

但我找不到一种方法来切割这个稀疏日期时间索引的所有时间范围,最终得到我想要的输出:原始数据帧包含" real"一段时间。

这可能不是最有效的方法,所以如果您有其他选择,请不要犹豫!

1 个答案:

答案 0 :(得分:0)

我使用apply找到了一种更有效的方法:

 def get_range(row):
  '''returns a DataFrame containing the day-range from a "start_date"
  and a "end_date"'''
  start_date = row["start_date"]
  end_date = row["end_date"]
  period = pd.date_range(start_date, end_date, freq="1D")

  return pd.Dataframe(period, columns='days_in_period')

# -- Apply get_range() to the initial df
t_all = df.apply(get_range)
# -- Drop overlapping dates
t_all.drop_duplicates(inplace=True)