Question

我有一个带有时间戳索引的数据框（数万个项目）和一个与某些事件对应的时间戳列表。我需要在任何事件发生前n分钟标记数据框中的所有项目，因此我编写了以下代码：

for timestamp in events:
    df.loc[timestamp - timespan : timestamp, 'is_before_event'] = True

事实证明它很慢，所以我尝试首先建立一个必须更改的所有元素的索引，然后对所有元素进行单一赋值：

for timestamp in events:
    temp_index = temp_index.append(df.loc[timestamp - timespan : timestamp].index)
df.loc[df.index.isin(temp_index), 'is_before_event'] = True

此代码的运行速度至少比我第一次尝试快100倍。

为什么会这样，在这种情况下做出任务的正确方法是什么？

Answer 1

我认为如果需要loc和True值，您可以将布尔掩码分配给不含False的列。

还必须使用numpy.concatenate与numpy.unique一起加入所有索引以删除重复项。

temp_index = []
for timestamp in events:
     temp_index.append(df.loc[timestamp - timespan : timestamp].index)
df['is_before_event'] = df.index.isin(np.concatenate(temp_index))

示例（列表理解与上述解决方案相同）：

rng = pd.date_range('2017-04-03', periods=20, freq='T')
df = pd.DataFrame({'a': range(20)}, index=rng)  
#print (df)

events = pd.to_datetime(['2017-04-03 00:03:00', '2017-04-03 00:09:45'])
t = pd.Timedelta('00:03:00')

temp_index = [df.loc[timestamp - t : timestamp].index for timestamp in events]
idx = np.unique(np.concatenate(temp_index))
df['is_before_event'] = df.index.isin(idx)
print (df)
                      a  is_before_event
2017-04-03 00:00:00   0             True
2017-04-03 00:01:00   1             True
2017-04-03 00:02:00   2             True
2017-04-03 00:03:00   3             True
2017-04-03 00:04:00   4            False
2017-04-03 00:05:00   5            False
2017-04-03 00:06:00   6            False
2017-04-03 00:07:00   7             True
2017-04-03 00:08:00   8             True
2017-04-03 00:09:00   9             True
2017-04-03 00:10:00  10            False
2017-04-03 00:11:00  11            False
2017-04-03 00:12:00  12            False
2017-04-03 00:13:00  13            False
2017-04-03 00:14:00  14            False
2017-04-03 00:15:00  15            False
2017-04-03 00:16:00  16            False
2017-04-03 00:17:00  17            False
2017-04-03 00:18:00  18            False
2017-04-03 00:19:00  19            False

类似的解决方案：

temp_index = [df.loc[timestamp - t : timestamp].index for timestamp in events]
idx = np.unique(np.concatenate(temp_index))
df['is_before_event'] = False
df.loc[idx, 'is_before_event'] = True
print (df)
                      a  is_before_event
2017-04-03 00:00:00   0             True
2017-04-03 00:01:00   1             True
2017-04-03 00:02:00   2             True
2017-04-03 00:03:00   3             True
2017-04-03 00:04:00   4            False
2017-04-03 00:05:00   5            False
2017-04-03 00:06:00   6            False
2017-04-03 00:07:00   7             True
2017-04-03 00:08:00   8             True
2017-04-03 00:09:00   9             True
2017-04-03 00:10:00  10            False
2017-04-03 00:11:00  11            False
2017-04-03 00:12:00  12            False
2017-04-03 00:13:00  13            False
2017-04-03 00:14:00  14            False
2017-04-03 00:15:00  15            False
2017-04-03 00:16:00  16            False
2017-04-03 00:17:00  17            False
2017-04-03 00:18:00  18            False
2017-04-03 00:19:00  19            False

Pandas .loc多个赋值与单个赋值

1 个答案: