Question

我想从下图中删除那9个橙色离群点，为此，我需要计算每个橙色点的准确度得分，并选择9个最低点。我怎样才能做到这一点？我知道可以计算整个预测精度的函数，但是有没有办法针对每个点进行计算呢？

我尝试执行此操作，但是从此处获取的x和y值与图中的异常值不匹配。（我正在使用sklearn线性回归）

score_array = []
for i in range(len(x_train)):
    #reshaping to fit the predict() function
    x = np.array(x_train[i]).reshape(1, -1)
    pred = clf.predict(x)
    # calculating square difference of y_expected and y_predicted
    score = y_train[i]**2 - pred**2
    score_array.append(score) # array containing score for each dot
# larger the difference between squares, higher chance of being an outlier
# sorting array in descending order
score_array = sorted(score_array, key = float, reverse = True)
# first 9 members will have largest difference of squares
# outlier_score array contains score of 9 dots we want to remove
outlier_score = score_array[0:9]
outlier_array_x = []; outlier_array_y = []
# we traverse again to see which x and y result in highest scores
for i in range(len(x_train)):
    x = np.array(x_train[i]).reshape(1, -1)
    pred = clf.predict(x)
    score = y_train[i]**2 - pred**2
    # if the score for current index i is in outlier_score, we get x and y values
    if score in outlier_score:
        outlier_array_x.append(x_train[i])
        outlier_array_y.append(y_train[i])

编辑：感谢下面的人，我解决了这个问题，但是现在我很难去除这些点。旧数组的长度分别为90，新数组的长度为预期的81，但是在绘制图形时，这9个离群值保持不变。

从数组中删除特定值的最佳方法是什么？我尝试这样做，但随后x和y值被混洗，导致图形完全不同

编辑2：

我使用此循环删除了元素：

j = 0
for i in index_array:
    i = i - j
    del x_train[i]
    del y_train[i]
    j += 1

Answer 1

whenReady不是实际与预期之间的距离。（y_train总是大于pred吗？为什么对于您指出的异常值，此距离度量值最低？）

尝试y_train[i]**2 - pred**2以获得实际距离。

Answer 2

精度告诉您正确分类了多少个数据点。这对于单个数据点或回归没有意义。您可以采用其他功能，例如均方误差或从预测到实际值的任何其他“距离”。

您的score值正在执行以下操作。因此，您需要找到score最大的点。您有一个score_array可以排序，可以直接使用。然后，您无需重新计算预测并在数组中查找浮点值。

请注意，使用L = [0.9, 0.1, 0.3, 0.4]可以使用enumerate(L)在L中配对索引和得分/值：

>>> sorted(enumerate(L), key = lambda (i,v) : v, reverse = True)
[(0, 0.9), (3, 0.4), (2, 0.3), (1, 0.1)]

然后，您可以跳过其中的前n个。例如

>>> sorted(enumerate(L), key = lambda (i,v) : v, reverse = True)[2:]
[(2, 0.3), (1, 0.1)]

所以，而不是

score_array = sorted(score_array, key = float, reverse = True)

尝试

score_array = sorted(enumerate(score_array), key = lambda (i,v) : v, reverse = True)

然后可以删除其中的前几个，因为其中包含x和y值的索引。您甚至可以抛弃任何超出一定距离的东西。

编辑：

我们观察到您需要使用误差的平方，而不是平方的误差，如其他答案所示。

要获得新的训练集，请使用score_array中的索引，现在是（index，value）的元组，像这样

 x_train = [x_train[x[0]] for x in score_array]

以及对应的y值。

删除离群值以进行线性回归（Python）

2 个答案: