K-手段||又名K-Means ++ Scalable - 实现问题

时间:2016-04-24 15:02:43

标签: java algorithm cluster-analysis k-means

编辑:代码更新,评论,效果信息

我试图编写K-means ||在Java中。 (http://vldb.org/pvldb/vol5/p622_bahmanbahmani_vldb2012.pdf) 但是,它并没有很好地运作。与标准K-means相比,运行时间增加并不令我感到惊讶。我更想知道为什么我的程序的检测率用K-means ||训练与使用标准K-means的训练相比较低。怎么可能选择集群点比偶然选择集群点更差?

更新:如果在互联网关闭时发现了一些错误,k-means ||现在表现与k-means标准一样好 - 但不会好一点。

我很确定我的代码是错误的,但经过几个小时的搜索,我不知道我在哪里犯了错误(坦率地说,我对这个很陌生)主题)。

所以我希望你能看到我做错了什么。这是我的播种选项的代码:

   public void training(int stop, int numberIt, double epsilon, boolean advanced){
    double d=Double.MAX_VALUE,s=0;
    int nearestprototype=0;
    int [] myprototype=new int[trainingsSet.size()];
    Random random=new Random();
    //
    long t1=System.currentTimeMillis();
    if(!advanced){//standard random k-means seeding; random datapoints are choosen as prototypes
    for(int i=0; i<k; i++){
        int rand = random.nextInt(trainingsSet.size());
        prototypes[i]=trainingsSet.getVectorAtIndex(rand);

    }
    }else{ //state-of-the-art k-means|| a.k.a k-means++ scalable seeding; explanation here: http://vldb.org/pvldb/vol5/p622_bahmanbahmani_vldb2012.pdf

    prototypes[0]=trainingsSet.getVectorAtIndex(random.nextInt(trainingsSet.size())); //first protoype, chosen randomly
    Vector<DataVector>kproto=new Vector<DataVector>(); //saves the prototypes
    kproto.add(prototypes[0]);
    for(int i=0;i<trainingsSet.size();i++){ //gets distance to all data points, sum it up
        s+=trainingsSet.getVectorAtIndex(i).distance2(kproto.elementAt(0));
    }
    double it=Math.floor(Math.log(s)); // calculates how often the loop for step 4 and 5 is executed 
    for(int c=0; c<it; c++){
        int[]psi=new int[trainingsSet.size()];//saves minimum distance to a protoype for every element
        for(int i=0; i<trainingsSet.size();i++){
            double min=Double.POSITIVE_INFINITY;
            for(int j=0;j<kproto.size();j++){
                double dist=trainingsSet.getVectorAtIndex(i).distance2(kproto.elementAt(j));
                if(min>dist){
                    min=dist;
                }
            }
            psi[i]=(int) min;
        }
        double phi_c=0;
        for(int i=0; i<trainingsSet.size();i++)
            phi_c+=psi[i]; //sums up squared distances

        for(int f=0; f<trainingsSet.size();f++){
            double p_x=5*psi[f]/phi_c; //oversampling factor 0.1*k (k is 50 in my case)
            if(p_x>random.nextDouble()){
                kproto.addElement(trainingsSet.getVectorAtIndex(f));//adds data point to the prototype set with a probability 
                //depending on its distance to the next prototype
            }
        }
    }
    int[]w=new int[kproto.size()]; //every prototype gets a value in w; the value is increased if the prototype has a minimum distance to a data point
    for(int i=0; i<trainingsSet.size();i++){
        double min=trainingsSet.getVectorAtIndex(i).distance2(kproto.elementAt(0));
        if(min==0)
            continue;
        int index=0;
        for(int j=1; j<kproto.size();j++){
            double save=trainingsSet.getVectorAtIndex(i).distance2(kproto.elementAt(j));
            if(min>save){
                min=save;
                index=j;
            }
        }
        w[index]++;
    }
    int[]wtotal=new int[kproto.size()]; //wtotal sums the w values up
    for(int i=0;i<kproto.size();i++){
        for(int st=0; st<=i;st++){
            wtotal[i]+=w[st];
        }
    }
    int[]cselect=new int[k];//cselect saves the final prototypes

    int stoppoint=0;
    boolean repeat=false; //repeat lets the choosing process repeat if the prototype has already been selected

    for(int kk=0;kk<k;kk++){
        do{ 

        repeat=false;   
        int stopper=random.nextInt(wtotal[kproto.size()-1]);//randomly choose a int and check in which interval it lies
        for(int st=wtotal.length-1;st>=0;st--){
            if(stopper>=wtotal[st]){ 
                stoppoint=wtotal.length-st-1; 
                break;
            }   
        }

        for(int i=0; i<kk;i++){
            if(cselect[i]==stoppoint)
                repeat=true;
        }
        }while(repeat);
        //are all prototypes overwritten?

        prototypes[kk]=kproto.get(stoppoint);//the number of the interval is connected to a prototype; the prototype is added to the final set of prototypes "prototypes"
        cselect[kk]=stoppoint;

    }
    }
    long t2=System.currentTimeMillis();
    System.out.println(advanced+" Init time: "+(t2-t1));

表现显示两个选项(标准,k-means ||)达到正确聚类的水平(约85%)。但是,初始化的运行时间不同。 对于标准k-均值,种子是准立即的,而k-均为||需要600-900毫秒(1000个数据点)。之后标准最大化/期望的收敛需要相同的时间(大约1900-2500ms)。这是刺激因为k-means ||应该收敛得更快。

我希望你发现一些错误,或者如果我期待别的东西而不是k-means ||来解释我可以提供。 谢谢你的帮助!

0 个答案:

没有答案