从列表项对中查找向量的组合

时间:2018-06-25 15:30:39

标签: r optimization combinations

我有一个命名列表,代表一系列生物途径,其中名称是途径名称,列表中的载体是属于该途径的蛋白质。一个小例子是:

ann <- structure(list(`GO:0000010` = c("Q33DR2", "Q9CZQ1", "D6RHT8", 
"F6ZCX7", "B8JJX0", "Q33DR3", "F6T4Z4", "E0CYM9"), `GO:0000016` = c("Q5XLR9", 
"Q3TZ78", "F8VPT3"), `GO:0000026` = c("Q8BTP0", "Q3TZM9", "A0A077K846", 
"F6R220", "A0A077K9W9"), `GO:0000032` = c("Q924M7", "Q3V100", 
"F6Q3K8", "Q921Z9"), `GO:0000033` = c("Q9DBE8", "F6RBY3", "Q8BMZ4", 
"Q8K2A8", "F6XUH0", "D6RCW8", "Q6P8H8", "Q3URN2")), .Names = c("GO:0000010", 
"GO:0000016", "GO:0000026", "GO:0000032", "GO:0000033"))

我对成对的通路感兴趣:

pairs <- t(combn(names(ann), 2))

对于每对途径,我想获得蛋白质的所有可能组合,其中蛋白质#1在途径#1中,而蛋白质#2在途径#2中。所需的输出是两列矩阵的列表,其中列#1包含路径#1中的蛋白质,列#2包含路径#2中的蛋白质。到目前为止,我有这个:

protein_pairs <- purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid(ann[[.x]], ann[[.y]])))

但是,由于我感兴趣的对的总数非常大(通常> 1,000),因此在所有可能的对上映射expand.grid会花费很长时间-大约几个小时。

是否有更快的方法从此清单中获取每对生物途径中所有可能的蛋白质组合?

2 个答案:

答案 0 :(得分:1)

我认为rep.int()的运行速度要比其他question:

快得多

尝试以下操作:

expand.grid.jc <- function(seq1,seq2) {
  cbind(Var1 = rep.int(seq1, length(seq2)), 
        Var2 = rep.int(seq2, rep.int(length(seq1),length(seq2))))
}
protein_pairs <- purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid.jc(ann[[.x]], ann[[.y]])))

干杯!,

答案 1 :(得分:1)

如果您追求速度,则可以轻松实现Rcpp版本:

// [[Rcpp::export]]
CharacterMatrix fast2Expand(CharacterVector x, CharacterVector y) {

    unsigned long int lenX = x.size(), lenY = y.size();
    CharacterMatrix result = no_init_matrix(lenX * lenY, 2);

    for (std::size_t i = 0, count = 0; i < lenY; ++i) {
        for (std::size_t j = 0; j < lenX; ++j, ++count){
            result(count, 0) = x[j];
            result(count, 1) = y[i];
        }
    }

    return result;
}

它比原始版本快10x,比20%版本快rep.int(对于此示例):

microbenchmark(OP = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid(ann[[.x]], ann[[.y]]))),
               Rcpp = purrr::map2(pairs[, 1], pairs[, 2], ~ fast2Expand(ann[[.x]], ann[[.y]])),
               repInt = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid.jc(ann[[.x]], ann[[.y]]))))
Unit: microseconds
  expr      min        lq      mean    median        uq      max neval
    OP 1104.700 1136.4370 1536.4048 1188.9990 1481.4940 6730.960   100
  Rcpp  105.505  126.9975  149.9009  138.1195  150.2015  663.146   100
repInt  133.044  151.0175  223.9815  165.5435  203.5335 1269.194   100

这是一个基于OP的示例而设计的示例,纯粹是为了比较效率:

annBig <- lapply(1:5, function(x) rep(ann[[x]], 100))
names(annBig) <- names(ann)

microbenchmark(OP = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid(annBig[[.x]], annBig[[.y]]))),
               Rcpp = purrr::map2(pairs[, 1], pairs[, 2], ~ fast2Expand(annBig[[.x]], annBig[[.y]])),
               repInt = purrr::map2(pairs[, 1], pairs[, 2], ~ as.matrix(expand.grid.jc(annBig[[.x]], annBig[[.y]]))), times = 20)
Unit: milliseconds
  expr       min        lq      mean    median       uq      max neval
    OP 522.56536 533.39393 562.60750 555.45345 588.4514 640.8584    20
  Rcpp  48.12683  56.17155  92.30095  92.23838 125.8065 142.2949    20
repInt  80.28625 107.32329 140.32793 152.13732 160.9656 193.1310    20