基于时间戳的pandas合并不完全匹配

时间:2016-01-19 15:21:54

标签: python pandas

哪些方法可用于合并时间戳不完全匹配的列?

DF1:

date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:13 7261824 871631182

DF2:

date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:37 7261824 871631182

我可以加入[' date',' employee_id',' session_id'],但有时相同的员工会同时拥有多个相同的会话导致重复的日期。我可以删除发生这种情况的行,但如果我这样做,我将失去有效的会话。

如果DF1的时间戳距离DF2的时间戳<5分钟,并且session_id和employee_id也匹配,是否有一种有效的加入方式?如果存在匹配记录,则时间戳将始终稍晚于DF1,因为事件将在未来某个时间触发。

['employee_id', 'session_id', 'timestamp<5minutes']

编辑 - 我以为有人会在此之前遇到过这个问题。

我在考虑这样做:

  1. 在每个数据框上记录我的时间戳
  2. 创建一个时间戳为+ 5分钟(四舍五入)的列
  3. 创建一个时间戳列 - 5分钟(四舍五入)
  4. 创建一个10分钟的间隔字符串以加入

    上的文件
    df1['low_time'] = df1['start_time'] - timedelta(minutes=5)
    df1['high_time'] = df1['start_time'] + timedelta(minutes=5)
    df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)
    
  5. 有人知道如何将这5分钟的间隔绕到最近的5分钟?

    02:59:37 - 5 min = 02:55:00

    02:59:37 + 5 min = 03:05:00

    interval_string =&#39; 02:55:00-03:05:00&#39;

    pd.merge(df1, df2, how = 'left', on = ['employee_id', 'session_id', 'date', 'interval_string']
    

    有谁知道如何绕过这样的时间?这似乎可行。您仍然根据日期,员工和会话进行匹配,然后查找基本上在相同的10分钟间隔或范围内的时间

3 个答案:

答案 0 :(得分:12)

我会尝试在熊猫中使用此方法:

pandas.merge_asof()

您感兴趣的参数将是directiontoleranceleft_onright_on

构建@Igor答案:

import pandas as pd
from pandas import read_csv
from io import StringIO

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)



df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])

# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['date_start_time']
df2.index = df2['date_start_time']
# the magic happens below, check the direction and tolerance arguments
tol = pd.Timedelta('5 minute')
pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)

output

date_start_time date_start_time_x   employee_id_x   session_id_x    date_start_time_y   employee_id_y   session_id_y

2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182   2016-01-01 02:03:00 7261824.0   871631182.0
2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183   2016-01-01 06:05:00 7261824.0   871631183.0
2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184   2016-01-01 11:04:00 7261824.0   871631184.0
2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185   NaT NaN NaN

答案 1 :(得分:5)

考虑以下迷你版本的问题:

from io import StringIO
from pandas import read_csv, to_datetime

# how close do sessions have to be to be considered equal? (in minutes)
threshold = 5

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)

给出了

>>> df1
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:03:00      7261824   871631183
2 2016-01-01 11:01:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
>>> df2
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:10:00      7261824   871631185

您希望在合并时将df2[0:3]视为df1[0:3]的重复项(因为它们分别相隔不到5分钟),但请将df1[3]df2[3]视为单独处理会话。

解决方案1:间隔匹配

这基本上就是您在编辑中建议的内容。您希望将两个表中的时间戳映射到以时间戳为中心的10分钟间隔,四舍五入到最接近的5分钟。

每个间隔可以通过其中点唯一地表示,因此您可以将时间戳上的数据框合并到最接近的5分钟。例如:

import numpy as np

# half-threshold in nanoseconds
threshold_ns = threshold * 60 * 1e9

# compute "interval" to which each session belongs
df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)

# join
cols = ['interval', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

打印

             interval  employee_id  session_id
0 2016-01-01 02:05:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:00:00      7261824   871631184
3 2016-01-01 14:00:00      7261824   871631185
4 2016-01-01 11:05:00      7261824   871631184
5 2016-01-01 14:10:00      7261824   871631185

请注意,这并非完全正确。会话df1[2]df2[2]不会被视为重复,尽管它们相距仅3分钟。这是因为它们位于区间边界的不同侧面。

解决方案2:一对一匹配

这是另一种方法,它取决于df1中的会话在df2中有零个或一个重复的条件。

我们将df1中的时间戳替换为df2中与employee_id匹配的最接近的时间戳,session_id 距离不到5分钟

from datetime import timedelta

# get closest match from "df2" to row from "df1" (as long as it's below the threshold)
def closest(row):
    matches = df2.loc[(df2.employee_id == row.employee_id) &
                      (df2.session_id == row.session_id)]

    deltas = matches.date_start_time - row.date_start_time
    deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]

    try:
        return matches.loc[deltas.idxmin()]
    except ValueError:  # no items
        return row

# replace timestamps in "df1" with closest timestamps in "df2"
df1 = df1.apply(closest, axis=1)

# join
cols = ['date_start_time', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

打印

      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
4 2016-01-01 14:10:00      7261824   871631185

这种方法要慢得多,因为您必须在df2中的每一行搜索整个df1。我所写的内容可能会进一步优化,但这仍需要很长时间才能完成大型数据集。

答案 2 :(得分:0)

我建议使用内置的pandas Series dt舍入函数将两个数据帧舍入到一个公共时间,例如,每5分钟舍入一次。因此,时间将始终采用以下格式:例如01:00:00,然后是01:05:00。这样,两个数据帧都将具有相似的时间索引来执行合并。

请在此处查看文档和示例pandas.Series.dt.round

相关问题