Question

我有以下dataset（获得here）：

----------item survivalpoints weight
1  pocketknife             10      1
2        beans             20      5
3     potatoes             15     10
4       unions              2      1
5 sleeping bag             30      7
6         rope             10      5
7      compass             30      1

我可以使用二进制字符串作为我的初始中心选择，将此数据集聚类为三个具有kmeans()的聚类。例如：

## 1 represents the initial centers
chromosome = c(1,1,1,0,0,0,0)
## exclude first column (kmeans only support continous data)
cl <- kmeans(dataset[, -1], dataset[chromosome == 1, -1])
## check the memberships
cl$clusters
# [1] 1 3 3 1 2 1 2

使用这个基本概念，我尝试使用GA包进行搜索，我正在尝试优化（最小化）Davies-Bouldin（DB）索引。

library(GA)           ## for ga() function
library(clusterSim)   ## for index.DB() function

## defining my fitness function (Davies-Bouldin)
DBI <- function(x) {
        ## converting matrix to vector to access each row
        binary_rep <- split(x, row(x))
        ## evaluate the fitness of each chromsome
        for(each in 1:nrow(x){
            cl <- kmeans(dataset, dataset[binary_rep[[each]] == 1, -1])
            dbi <- index.DB(dataset, cl$cluster, centrotypes = "centroids")
            ## minimizing db
            return(-dbi)
    }
}

g<- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))

当然（我不知道发生了什么），我收到了错误消息 Warning messages: Error in row(x) : a matrix-like object is required as argument to 'row'

以下是我的问题：

如何正确使用GA包来解决我的问题？
如何确保随机生成的染色体包含与1个簇数相对应的k个相同数量（例如，如果k=3则染色体必须包含三个1 S）？

Answer 1

我无法评论将k-means与ga相结合的感觉，但我可以指出你的健身功能存在问题。此外，当所有基因打开或关闭时都会产生错误，因此只有在不是这种情况时才能计算适应度：

DBI <- function(x) {
  if(sum(x)==nrow(dataset) | sum(x)==0){
    score <- 0
  } else {
    cl <- kmeans(dataset[, -1], dataset[x==1, -1])
    dbi <- index.DB(dataset[,-1], cl=cl$cluster, centrotypes = "centroids")
    score <- dbi$DB
  }

  return(score)
}

g <- ga(type = "binary", fitness = DBI, popSize = 100, nBits = nrow(dataset))
plot(g)

g@solution
g@fitnessValue

看起来几种基因组合产生相同的＃34;最佳＆＃34;健身价值

利用遗传算法优化K均值聚类

1 个答案: