Question

我试图找出标准偏差的秒数异常值。我有两个数据帧，如下所示。我试图找到的异常值与星期几的平均值相差1.5个标准偏差？当前代码低于数据帧。

DF1：

name    dateTime              Seconds
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
john    2015-01-02 13:13:13   12345.0101
joe     2015-02-04 12:12:12   54321.0202
joe     2015-01-02 13:13:13   12345.0101

当前输出：df2

name   day   standardDev        mean           count
Joe    mon   22326.502700       40900.730647   1886
       tue   9687.486726        51166.213836   159
john   mon   10072.707891       41380.035108   883
       tue   5499.475345        26985.938776   196

预期产出：

df2

name   day   standardDev        mean           count     events
Joe    mon   22326.502700       40900.730647   1886      [2015-02-04 12:12:12, 2015-02-04 12:12:13]
       tue   9687.486726        51166.213836   159       [2015-02-04 12:12:12, 2015-02-04 12:12:14]
john   mon   10072.707891       41380.035108   883       [2015-01-02 13:13:13, 2015-01-02 13:13:15]
       tue   5499.475345        26985.938776   196       [2015-01-02 13:13:13, 2015-01-02 13:13:18]

代码：

allFiles = glob.glob(folderPath + "/*.csv")
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_, index_col=None, names=['EventTime', "IpAddress", "Hostname", "TargetUserName", "AuthenticationPackageName", "TargetDomainName", "EventReceivedTime"])
    df = df.ix[1:]
    list_.append(df)
df = pd.concat(list_)
df['DateTime'] = pd.to_datetime(df['EventTime'])
df['day_of_week'] = df.DateTime.dt.strftime('%a')
df['seconds'] = pd.to_timedelta(df.DateTime.dt.time.astype(str)).dt.seconds
print(df.groupby((['TargetUserName', 'day_of_week'])).agg({'seconds': {'mean': lambda x: (x.mean()), 'std': lambda x: (np.std(x)), 'count': 'count'}}))

Answer 1

这是pandas docs的略微改编。我没有创建平均值和列的列。 std，但如果你想看到它，你可以很容易地添加它。

np.random.seed(1111)
df=pd.DataFrame({ 'name':     ['joe','john']*30, 
                  'dateTime': pd.date_range('1-1-2015',periods=60),
                  'Seconds':  np.random.randn(60)+5000. })

grp = df.groupby(['name',df.dateTime.dt.dayofweek])['Seconds']
df['zscore'] = grp.transform( lambda x: (x-x.mean())/x.std())

df[ df['zscore'].abs() > 1.5 ]
Out[79]: 
        Seconds   dateTime  name    zscore
1   4998.927011 2015-01-02  john -1.522488
42  5001.275866 2015-02-12   joe  1.636829
58  4999.124550 2015-02-28   joe -1.624945

df.head(10)
Out[80]:
       Seconds   dateTime  name    zscore
0  4998.699990 2015-01-01   joe -0.959960
1  4998.927011 2015-01-02  john -1.522488
2  5000.790199 2015-01-03   joe  0.263690
3  4999.121735 2015-01-04  john -1.005137
4  5001.501822 2015-01-05   joe  1.132407
5  4999.976071 2015-01-06  john  0.678951
6  5000.275949 2015-01-07   joe  0.650297
7  4999.033607 2015-01-08  john -0.964222
8  4998.419685 2015-01-09   joe -1.328744
9  4999.796325 2015-01-10  john  1.224198

查找数据异常值

1 个答案: