用于在K均值聚类中选择适当数量的聚类的剪影索引

时间:2014-01-14 09:28:05

标签: java cluster-analysis k-means

我正在使用Silhouette Index在KMeans群集中选择适当数量的群集。 Silhouette Index的代码为here。 基于此代码,我创建了自己的代码(见下文)。问题是对于任何数据集,优选的簇数总是等于最大值,即在这种情况下为15。 我的代码中有错误吗?

private double getSilhouetteIndex(double[][] distanceMatrix,ClusterEvaluation ceval)
{
    double si_index = 0;
    double[] ca = ceval.getClusterAssignments();
    double[] d_arr = new double[ca.length];
    List<Double> si_indexes = new ArrayList<Double>();

    for (int i=0; i<ca.length; i++)
    {
        // STEP 1. Compute the average distance between the i-th point and all other points of a given cluster
        double a = averageDist(distanceMatrix,ca,i,1);

        // STEP 2. Compute the average distance between the i-th point and all points of other clusters
        for (int j=0; j<ca.length; j++)
        {
            double d = averageDist(distanceMatrix,ca,j,2);
            d_arr[j] = d;
        }

        // STEP 3. Compute the the distance from the i-th point to the nearest cluster to which it does not belong
        double b = d_arr[0];
        for (Double _d : d_arr)
        {
            if (_d < b)
                b = _d;
        }

        // STEP 4. Compute the Silhouette index for the i-th point
        double si = (b - a)/Math.max(a,b);

        si_indexes.add(si);
    }

    // STEP 5. Compute the average index over all observations
    double sum = 0;
    for(Double _si : si_indexes)
    {
         sum += _si;
    }
    si_index = sum/si_indexes.size();

    return si_index;
}

private double averageDist(double[][] distanceMatrix, double[] ca, int id, int calc)
{       
    double avgDist = 0;
    double sum = 0;
    int len = 0;

    // Distances inside the cluster
    if (calc == 1)
    {
        for (int i = 0; i<ca.length; i++)
        {
            if (ca[i] == ca[id] && i != id)
            {
                sum += distanceMatrix[id][i];
                len++;
            }
        }
    }
    // Distances outside the cluster
    else
    {
        for (int i = 0; i<ca.length; i++)
        {
            if (ca[i] != ca[id] && i != id)
            {
                sum += distanceMatrix[id][i];
                len++;
            }
        }
    }

    avgDist = sum/len;

    return avgDist;
}

1 个答案:

答案 0 :(得分:0)

对于Silhouette Index,据我所知,当你计算群集外点的平均距离时,它实际上应该是the points from the nearest neighbor cluster而不是群集之外的所有点。