Numpy的标准差方法给出除零误差

时间:2017-01-02 17:12:50

标签: python numpy

我编写了一个函数来规范机器学习算法中的一组特征。它采用矩形2D numpy数组features并返回其正则化版本reg_features(我使用来自Scikit的波士顿住房价格数据 - 用于培训)。确切的代码:

import tensorflow as tf
import numpy as np
from sklearn.datasets import load_boston
from pprint import pprint

def regularise(features):

    # Regularised features:
    reg_features = np.zeros(features.shape)

    for x in range(len(features)):
        for y in range(len(features[x])):

            reg_features[x][y] = (features[x][y] - np.mean(features[:, y])) / np.std(features[:, y])

    return reg_features

# Get the data
total_features, total_prices = load_boston(True)

# Keep 300 samples for training
train_features = regularise(total_features[:300])        # Works OK
train_prices = total_prices[:300]

# Keep 100 samples for validation
valid_features = regularise(total_features[300:400])     # Works OK
valid_prices = total_prices[300:400]

# Keep remaining samples as test set
test_features = regularise(total_features[400:])         # Does not work
test_prices = total_prices[400:]

请注意,对于regularise()的最后一次调用,我才会收到此错误,该total_features[400:]regularise(total_features[400:])

  

/Users/RohanSaxena/Documents/projects/sdc/tensor/reg.py:11:RuntimeWarning:double_scalars中遇到无效值     reg_features [x] [y] =(features [x] [y] - np.mean(features [:,y]))/ np.std(features [:,y])

此代码的其余部分与最后一次调用有关,即for y in range(len(features[0])): if np.std(features[:, y]) == 0.: print(np.std(features[:, y])

要检查其中一个标准差是否为零,我这样做:

0.0
0.0
...
0.0

打印所有零,即:

features[0].size
总共features次。这意味着for y in range(len(features[0])): print(np.std(features[:, y]) 中每列的标准偏差为零。

现在这看起来很奇怪。所以我打印每一个标准偏差都是肯定的:

10.9976293017
23.3483275632
6.63216140033
....
8.00329244499

我得到所有非零值:

if

这怎么可能?就在之前,以{{1}}条件为前缀,这个相同的代码给了我全部零,现在它给出了非零值!这对我没有任何意义。任何帮助表示赞赏。

2 个答案:

答案 0 :(得分:1)

导致问题的是数据total_features[400:]的子集。如果您查看该数据,您会看到列total_features[400:, 1]total_features[400:, 3]都为0.这会导致代码出现问题,因为这些列的平均值和标准差都是0,结果为0/0。

您可以使用sklearn.preprocessing.scale而不是编写自己的正规化功能。该函数通过返回全为0的列来处理常量列。

您可以轻松验证scale是否与regularise执行相同的计算:

In [68]: test
Out[68]: 
array([[ 15.,   1.,   0.],
       [  3.,   4.,   5.],
       [  6.,   7.,   8.],
       [  9.,  10.,  11.],
       [ 12.,  13.,   1.]])

In [69]: regularise(test)
Out[69]: 
array([[ 1.41421356, -1.41421356, -1.20560706],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.72336423],
       [ 0.        ,  0.70710678,  1.44672847],
       [ 0.70710678,  1.41421356, -0.96448564]])

In [70]: from sklearn.preprocessing import scale

In [71]: scale(test)
Out[71]: 
array([[ 1.41421356, -1.41421356, -1.20560706],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.72336423],
       [ 0.        ,  0.70710678,  1.44672847],
       [ 0.70710678,  1.41421356, -0.96448564]])

以下显示了函数如何处理一列零:

In [72]: test[:,2] = 0

In [73]: test
Out[73]: 
array([[ 15.,   1.,   0.],
       [  3.,   4.,   0.],
       [  6.,   7.,   0.],
       [  9.,  10.,   0.],
       [ 12.,  13.,   0.]])

In [74]: regularise(test)
/Users/warren/miniconda3/bin/ipython:9: RuntimeWarning: invalid value encountered in double_scalars
Out[74]: 
array([[ 1.41421356, -1.41421356,         nan],
       [-1.41421356, -0.70710678,         nan],
       [-0.70710678,  0.        ,         nan],
       [ 0.        ,  0.70710678,         nan],
       [ 0.70710678,  1.41421356,         nan]])

In [75]: scale(test)
Out[75]: 
array([[ 1.41421356, -1.41421356,  0.        ],
       [-1.41421356, -0.70710678,  0.        ],
       [-0.70710678,  0.        ,  0.        ],
       [ 0.        ,  0.70710678,  0.        ],
       [ 0.70710678,  1.41421356,  0.        ]])

答案 1 :(得分:0)

通常当发生这种情况时,首先猜测你是将分子除以大于它的int(而不是浮点数),因此结果为0.但是在这里看不到这种情况。

有时,除法不按照您的预期(按术语进行),而是按向量操作。 然而,这也不是这种情况。

此处的问题是您如何引用数据框

reg_features[x][y]

在处理数据框并将值重新定位到要使用函数loc

的特定单元格时

您可以在此处详细了解http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html