Question

我有一个数据框target，其中包含列SNP和value：

target <- data.frame("SNP" = c("rs2", "rs4", "rs6", "rs19", "rs8", "rs9"),
                     "value" = 1:6)

我还有其他3个数据框，其中包含列SNP和int作为列表：

ref1 <- data.frame("SNP" = c("rs1", "rs2", "rs8"), "int" = c(5, 7, 88))
ref2 <- data.frame("SNP" = c("rs9", "rs4", "rs3"), "int" = c(23, 4, 43))
ref3 <- data.frame("SNP" = c("rs10", "rs6", "rs5"), "int" = c(53, 22, 76))
mylist <- list(ref1, ref2, ref3)

我想为int添加一个新列target，其值对应于具有相同int的ref1 / 2/3的SNP值。例如，int的第一个target值应为7，因为ref1的第2行的rs2的SNP和7的int。

我尝试了以下代码：

for (i in 1:3) {
    target <- target %>%
                left_join(mylist[[i]], by = "SNP")
}

匹配快速成功。但是，我返回了3个新列，而不是1个，如下所示：

然后我使用以下代码：

target[, "ref"] <- NA
for (i in 1:3) {
    common <- Reduce(intersect, list(target$SNP, mylist[[i]]$SNP))

    tar.pos <- match(common, target$SNP)
    ref.pos <- match(common, mylist[[i]]$SNP)

    target$ref[tar.pos] <- mylist[[i]]$int[ref.pos]
}

在我的真实数据中，我有22个参考数据帧，每个参考数据帧都有1-6百万行。我宁愿通过ref进行匹配和加入ref，而不是将所有ref合并为一个大数据。当我在真实数据上尝试上述第二种方法时，我注意到match函数的运行速度非常慢。这就是为什么我更喜欢一些聪明的工作方式。我发现left_join甚至对我的大数据也非常有效。不幸的是，输出结果并不是我想要的。

我希望快速完成上述工作，最好是在tidyverse中。关于如何修改第一种编码方法或任何其他更聪明的方法，有什么建议吗？

Answer 1

如果将mylist中的所有数据绑定并合并到target会占用太多内存，则可以使用purrr::reduce来逐一合并。

library(tidyverse)

reduce(mylist,
       ~ left_join(.x, .y, by = "SNP") %>%
         mutate(int = coalesce(int.x, int.y)) %>%
         select(-c(int.x, int.y)),
       .init = mutate(target, int = NA_real_))

#    SNP value int
# 1  rs2     1   7
# 2  rs4     2   4
# 3  rs6     3  22
# 4 rs19     4  NA
# 5  rs8     5  88
# 6  rs9     6  23

Answer 2

借助tidyverse，我们也可以做到

library(dplyr)
bind_rows(mylist) %>%
  right_join(target, by = "SNP")

Answer 3

您可以将mylist转换为一个数据帧，然后将merge与target

merge(target, do.call(rbind, mylist), by = "SNP", all.x = TRUE)

#   SNP value int
#1 rs19     4  NA
#2  rs2     1   7
#3  rs4     2   4
#4  rs6     3  22
#5  rs8     5  88
#6  rs9     6  23

或使用dplyr

library(dplyr)
left_join(target, bind_rows(mylist), by = "SNP")

或者在data.table

中

library(data.table)
rbindlist(mylist)[target, on = 'SNP']

根据dplyr中多个数据框中的值将列添加到数据框中

3 个答案: