如何识别每个群集中的序列?

时间:2014-01-24 21:22:34

标签: r cluster-analysis data-manipulation traminer

使用作为TraMineR的一部分的生物基因数据集:

library(TraMineR)
data(biofam)
lab <- c("P","L","M","LM","C","LC","LMC","D")
biofam.seq <- seqdef(biofam[,10:25], states=lab)
head(biofam.seq)
     Sequence                                    
1167 P-P-P-P-P-P-P-P-P-LM-LMC-LMC-LMC-LMC-LMC-LMC
514  P-L-L-L-L-L-L-L-L-L-L-LM-LMC-LMC-LMC-LMC    
1013 P-P-P-P-P-P-P-L-L-L-L-L-LM-LMC-LMC-LMC      
275  P-P-P-P-P-L-L-L-L-L-L-L-L-L-L-L             
2580 P-P-P-P-P-L-L-L-L-L-L-L-L-LMC-LMC-LMC       
773  P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P 

我可以执行聚类分析:

library(cluster)
couts <- seqsubm(biofam.seq, method = "TRATE")
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = couts)
clusterward <- agnes(biofam.om, diss = TRUE, method = "ward")
cluster3 <- cutree(clusterward, k = 3)
cluster3 <- factor(cluster3, labels = c("Type 1", "Type 2", "Type 3"))

但是,在此过程中,biofam.seq中的唯一ID已被数字1到N的列表所取代:

head(cluster3, 10)
[1] Type 1 Type 2 Type 2 Type 2 Type 2 Type 3 Type 3 Type 2 Type 1
[10] Type 2
Levels: Type 1 Type 2 Type 3

现在,我想知道每个簇中哪些序列,以便我可以应用其他函数来获得每个簇内的平均长度,熵,子序列,相异性等。我需要做的是:

  1. 将旧ID映射到新ID
  2. 将每个群集中的序列插入单独的序列对象
  3. 在每个新序列对象上运行我想要的统计信息
  4. 如何在上面的列表中完成2和3?

2 个答案:

答案 0 :(得分:1)

我认为这会回答你的问题。我使用了我在http://www.bristol.ac.uk/cmm/software/support/workshops/materials/solutions-to-r.pdf找到的代码来创建biofam.seq,因为你所建议的都没有为我工作。

# create data
library(TraMineR)
data(biofam)
bf.states  <- c("Parent", "Left", "Married", "Left/Married", "Child",
                "Left/Child", "Left/Married/Child", "Divorced")
bf.shortlab <- c("P","L","M","LM","C","LC", "LMC", "D")
biofam.seq  <- seqdef(biofam[, 10:25], states = bf.shortlab,
                                       labels = bf.states)

# cluster
library(cluster)
couts <- seqsubm(biofam.seq, method = "TRATE")
biofam.om <- seqdist(biofam.seq, method = "OM", indel = 3, sm = couts)
clusterward <- agnes(biofam.om, diss = TRUE, method = "ward")
cluster3 <- cutree(clusterward, k = 3)
cluster3 <- factor(cluster3, labels = c("Type 1", "Type 2", "Type 3"))

首先,我使用split为每个集群创建索引列表,然后在lapply循环中使用它来创建biofam.seq的子序列列表:< / p>

# create a list of sequences
idx.list <- split(seq_len(nrow(biofam)), cluster3)
seq.list <- lapply(idx.list, function(idx)biofam.seq[idx, ])

最后,您可以使用lapplysapply

对每个子序列运行分析
# compute statistics on each sub-sequence (just an example)
cluster.sizes <- sapply(seq.list, FUN = nrow)

其中FUN可以是您通常在单个序列上运行的任何函数。

答案 1 :(得分:1)

例如,可以使用

简单地获得第一个聚类的状态序列对象
bio1.seq <- biofam.seq[cluster3=="Type 1",]
summary(bio1.seq)