Question

我有一个数据集，其中有4列，但仅需使用其中两列。一列是wayId，另一列是速度。

我正在使用的样本数据集中有大约2500万行。对于每个wayId，我都有多种速度。因此，我需要删除速度的异常值和整个行中速度异常的值，以作为每个唯一wayId的异常值。

假设wayId为1，我的速度为24,32,8,28,25,55。因此，此处必须将包含8和55的行从数据集中删除。我为此做了2-3个程序。但是他们很费时间。我需要在1-2秒内完成此处理。在一个程序中，我还使用了多重处理，而且很耗时。

此代码也需要大约130秒。这是我编写的代码的一瞥：

stds = 0
z = data[['wayId','speed']].groupby('wayId').transform(lambda group : (group - group.mean()).div(group.std()>stds))


outliers = z.abs() > stds

data=data[outliers.any(axis=1)]

data =  data.sort_values('wayId',ascending=True)

print(data.head(12).to_dict)

And it gives output this:
<bound method DataFrame.to_dict of           
     wayId   speed  savingTime  reverse

14880 64579671 18.5 1555391776错误 71176 64579671 18.5 1555391536错误 42482 64579671 18.5 1555391655错误 99383 64579671 18.5 1555391415错误 127647 64579671 34.5 1555391295错误 33390 64579691 48.5 1555391655错误 228657 64579691 73.5 1555390816错误 153167 64579691 65.5 1555391175错误 60517 64579691 48.5 1555391536错误 96053 64579691 48.5 1555391415错误 12322 64579691 48.5 1555391776错误 125722 64579691 48.5 1555391295 False>

如何在1-2秒内使用Python从大型数据集中删除异常值？

0 个答案: