Question

假设我有一个用户时间戳事件的数据帧，df1：

df1 = pd.DataFrame([
    {
        'id':1,
        'user_id':1,
        'time':pd.to_datetime('2017-01-01'),
    },
    {
        'id':2,
        'user_id':1,
        'time':pd.to_datetime('2017-01-02'),
    },
    {
        'id':3,
        'user_id':1,
        'time':pd.to_datetime('2017-02-01'),
    },    
    {
        'id':4,
        'user_id':2,
        'time':pd.to_datetime('2017-01-01'),
    },    
    {
        'id':5,
        'user_id':1,
        'time':pd.to_datetime('2017-01-15'),
    },
])

另一种事件表（例如预订），df2：

df2 = pd.DataFrame(
    [
        {
            'user_id':1,
            'time':pd.to_datetime('2017-01-02'),
            'booking_code':'AA1'
        },
        {
            'user_id':1,
            'time':pd.to_datetime('2017-01-10'),
            'booking_code':'AA2'
        },
        {
            'user_id':1,
            'time':pd.to_datetime('2017-03-10'),
            'booking_code':'AA3'
        },
        {
            'user_id':2,
            'time':pd.to_datetime('2016-12-10'),
            'booking_code':'AA4'
        },
        {
            'user_id':2,
            'time':pd.to_datetime('2017-03-10'),
            'booking_code':'AA5'
        },
        {
            'user_id':3,
            'time':pd.to_datetime('2017-03-10'),
            'booking_code':'AA6'
        },        
    ]
)

（示例dfs很长，以演示不同的情况）

我想做的是，为df1中的每一行找到与df2中该用户关联的下一个事件。

即。用语言： for user_id = 1，time = 2017-01-01：＆＃39; next event＆＃39;在df2中将是booking_code =＆＃39; AA1＆＃39;，时间= 2017-01-02。

所以我要找的结果是：

    time_1      user_id     next_booking_code   next_booking_time
id              
1   2017-01-01  1           AA1                 2017-01-02
2   2017-01-02  1           AA2                 2017-01-10
3   2017-02-01  1           AA3                 2017-03-10
4   2017-01-01  2           AA5                 2017-03-10
5   2017-01-15  1           AA3                 2017-03-10

到目前为止，我提出的解决方案如下：

#sort bookings by time
df2.sort_values('time',inplace=True)
#merge bookings with events, on user_id
df3 = df1.merge(
    df2,
    how='left',
    on = 'user_id'
)

#filter to bookings which are after the event
df3 = df3[
    df3.time_y > df3.time_x
]
#group by id to get one row per event
df3 = df3.groupby('id')
#get the first row for each event
df4 = df3.first()

#df4 is now the result we're after

现在，这在这个玩具数据集上非常完美，但是当事件数据为~10 ^ 6行时，这个过程就不起作用了。

我尝试的另一种方式是使用df.apply()逐行进行。喜欢的东西;

#use indexes for speedier retrieval
bookings = df2.set_index(['user_id','date'])
def get_next_booking(row):
    return bookings.loc[row.user_id].loc[row.date:].iloc[0].booking_code

df1['next_booking_code'] = df1.apply(get_next_booking, axis=1)

大数据的速度也很慢。

这让我感觉就像其中正确的方式之一，并且它的性能更高，但我还没有找到它，而且我＆＃39;我不喜欢将这个过程移动到SQL。

Answer 1

需要一些前/后处理来获得您想要的输出，但最近的pandas（版本0.19）添加了一个新函数merge_asof来有效地执行这些类型的连接。文档在这里 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge_asof.html

# `asof` field must be sorted
df1 = df1.sort_values('time')
df2 = df2.sort_values('time')
df2['next_booking_time'] = df2['time']

res = pd.merge_asof(df1, df2, on='time', by='user_id', 
                    direction='forward', allow_exact_matches=False)

res.sort_values('id')
Out[29]: 
   id       time  user_id booking_code next_booking_time
0   1 2017-01-01        1          AA1        2017-01-02
2   2 2017-01-02        1          AA2        2017-01-10
4   3 2017-02-01        1          AA3        2017-03-10
1   4 2017-01-01        2          AA5        2017-03-10
3   5 2017-01-15        1          AA3        2017-03-10

Answer 2

提高速度的建议是设置索引并对其进行排序

df1.set_index(["user_id"], inplace=True)
df1.sort_index(inplace=True)
df2.set_index(["user_id"], inplace=True)
df2.sort_index(inplace=True)
df3 = df1.merge( df2,how='left',left_index=True, right_index=True)

以开放时间间隔加入的有效方式（查找＆＃34;下一个＆＃34;其他df中的行）

2 个答案: