Question

我有一个大约90k样本和140个属性的数据集。在将数据集拆分为测试和训练集之后，我正在尝试构建xgboost模型。我的准确率非常高，大约99％，我觉得这很容易出错。我从测试集中取出了前100个样本，并找到了火车组中所有样本的eulidean距离。我发现火车组中有很多行带有几乎相似的值。现在我希望从训练集中消除这些行。我该如何完成这项任务？有没有一些库函数可以做到这一点？（找到成对的欧几里德距离并将行数降至阈值以下）请帮忙。前100行这样做是好的，如何为整个测试集做到这一点？有没有有效的方法？

Answer 1

您可以使用scikit-learn euclidean_distances来计算成对差异，以便在训练集中找到相似的值。

一旦你有了2d的距离数组，你可以依靠numpy的fill_diagonal来消除行与自身比较的0距离。
然后只需定义一个阈值并创建一个布尔数组，以识别那些低于阈值的值。
要获取True值的索引，请使用np.where返回y和x索引。
现在，压缩y和x索引，并使用集合理解和sorted删除重复项（如[0,5]和[5,0]）。

最后，您有一对低于给定阈值的行：

import pandas as pd
import numpy as np
from sklearn import metrics

# setup dummy data
df = pd.DataFrame(np.random.random(size=(30, 5)))
print(df.head)

          0         1         2         3         4
0  0.778678  0.041665  0.149135  0.171045  0.522252
1  0.993003  0.503661  0.799485  0.279497  0.735382
2  0.153082  0.897404  0.279562  0.561585  0.213728
3  0.376735  0.445812  0.931879  0.450042  0.154132
4  0.517949  0.779655  0.486816  0.785099  0.194537

# get distances
distances = metrics.pairwise.euclidean_distances(df)

# set self-distances to NaN
np.fill_diagonal(distances, np.NaN)

# define threshold
threshold = 0.3

# get indices
y_index, x_index = np.where(distances < threshold)

# get unique indices
close_indices = {tuple(sorted(x)) for x in zip(y_index, x_index)}
print(close_indices)

>>> {(0, 26), (1, 10), (4, 12), (10, 18), (12, 14), (13, 27)}

您现在可以遍历close_indices并且每个只删除一行。但是，某些行可能会出现多次。你必须要照顾好这个。

编辑 - MemoryError

随着数据帧大小的增加，欧氏距离数组太大而无法很快适应内存（大小为n * n）。好吧，为了避免这种情况，您可以迭代每一行并计算到其余行的欧氏距离。因此，每个得到的距离数组的最大尺寸为n。此外，发电机使用的内存更便宜。但是，这个解决方案比较慢，因为我们必须迭代每一行。

def iterate_distances(sub_df, threshold):

    def compute(sub_row):
        distances = metrics.pairwise.euclidean_distances(sub_df.iloc[sub_row, :].values.reshape(1, -1), 
                                                         sub_df.iloc[sub_row + 1:, :])

        y_index, x_index = np.where(distances < threshold)
        return ((sub_row, x + sub_row + 1) for x in x_index)

    row_count = sub_df.shape[0]

    return (index_pair 
            for row in range(row_count-1)
            for index_pair in compute(row))

result = iterate_distances(df, threshold)

result是一个生成器表达式。你可以正常循环它。要显示结果，您可以使用print(list(result))。您可以在迭代行块而不是仅一行时提高性能。

如何从数据集中删除相似的行？

1 个答案:

编辑 - MemoryError