R:如何分组几乎相似的单词

时间:2015-04-24 02:29:53

标签: r

我有一个每个8个字母的DNA序列字。大约有5万字,样本为“AAAAAAAA”“TTTTTTTT”“AAAAACGC”“AAAACCTG”等。现在我想按照这样的顺序对所有单词进行分组,使得6个相似字母的所有单词组合在一起。请有人帮助我。  我需要将所有2个替换字聚类成一个聚类,将2个以上的替换字聚类到另一个聚类中。例如,“AAAAACCA”可以属于“AAAAAAAA”和“AAAACCCA”。但是,“AAAAACCA”应该属于群集“AAAACCCA”,因为与“AAAAAAAA”相比它是1替换。假设“AAAAAAAG”可以属于“AAAAAAAA”或“AAAAAAAC”,但不能同时属于两者。我希望你明白我的观点,如果你有任何进一步的澄清,请评论我。谢谢。

    words <- sample[1:25]
> group <- lapply(words, function(x)list(x,words[agrep(x, words,max.distance=list(all=2, insertions=0, deletions=0, substitutions=2))]))
> group
[[1]]
[[1]][[1]]
[1] "AAAAAAAA"

[[1]][[2]]
 [1] "AAAAAAAA" "AAAAAAAC" "AAAAAAAG" "AAAAAAAT" "AAAAAACA" "AAAAAACC" "AAAAAACG" "AAAAAACT"
 [9] "AAAAAAGA" "AAAAAAGC" "AAAAAAGG" "AAAAAAGT" "AAAAAATA" "AAAAAATC" "AAAAAATG" "AAAAAATT"
[17] "AAAAACAA" "AAAAACAC" "AAAAACAG" "AAAAACAT" "AAAAACCA" "AAAAACGA"


[[2]]
[[2]][[1]]
[1] "AAAAAAAC"

[[2]][[2]]
 [1] "AAAAAAAA" "AAAAAAAC" "AAAAAAAG" "AAAAAAAT" "AAAAAACA" "AAAAAACC" "AAAAAACG" "AAAAAACT"
 [9] "AAAAAAGA" "AAAAAAGC" "AAAAAAGG" "AAAAAAGT" "AAAAAATA" "AAAAAATC" "AAAAAATG" "AAAAAATT"
[17] "AAAAACAA" "AAAAACAC" "AAAAACAG" "AAAAACAT" "AAAAACCC"


[[3]]
[[3]][[1]]
[1] "AAAAAAAG"

[[3]][[2]]
 [1] "AAAAAAAA" "AAAAAAAC" "AAAAAAAG" "AAAAAAAT" "AAAAAACA" "AAAAAACC" "AAAAAACG" "AAAAAACT"
 [9] "AAAAAAGA" "AAAAAAGC" "AAAAAAGG" "AAAAAAGT" "AAAAAATA" "AAAAAATC" "AAAAAATG" "AAAAAATT"
[17] "AAAAACAA" "AAAAACAC" "AAAAACAG" "AAAAACAT" "AAAAACCG"

如何减少输出中的redundency。

1 个答案:

答案 0 :(得分:4)

使用adist来电,您可以:

words <- c("AAAAAAAA", "TTTTTTTT", "AAAAAAGC", "AAAACCAA")
lapply(words, function(x) words[adist(x, words) < 3])

您也可以使用agrep尝试此操作,但它可能会慢得多:

words <- c("AAAAAAAA", "TTTTTTTT", "AAAAAAGC", "AAAACCAA")
d<-lapply(words, 
   function(x) list(match.word=x, six.letter.grp = words[agrep(x, words, 
   max.distance=list(all=2, insertions=0, deletions=0, substitutions=2))]))

这会输出以下列表,其中显示您要匹配的字词,以及它匹配的所有字词,包括字词本身,但您可以根据您想要的内容调整输出:

[[1]]
[[1]]$match.word
[1] "AAAAAAAA"

[[1]]$six.letter.grp
[1] "AAAAAAAA" "AAAAAAGC" "AAAACCAA"


[[2]]
[[2]]$match.word
[1] "TTTTTTTT"

[[2]]$six.letter.grp
[1] "TTTTTTTT"


[[3]]
[[3]]$match.word
[1] "AAAAAAGC"

[[3]]$six.letter.grp
[1] "AAAAAAAA" "AAAAAAGC"


[[4]]
[[4]]$match.word
[1] "AAAACCAA"

[[4]]$six.letter.grp
[1] "AAAAAAAA" "AAAACCAA"

对于更紧凑的列表结构,您可以尝试:

d <- lapply(words, function(x) words[agrep(x, words,
         max.distance=list(all=2, insertions=0, deletions=0, substitutions=2))])
names(d) <- words
d
#$AAAAAAAA
#[1] "AAAAAAAA" "AAAAAAGC" "AAAACCAA"
#
#$TTTTTTTT
#[1] "TTTTTTTT"
# 
#$AAAAAAGC
#[1] "AAAAAAAA" "AAAAAAGC"
#
#$AAAACCAA
#[1] "AAAAAAAA" "AAAACCAA"
相关问题