映射两个不同数据集中的通用名称

时间:2015-04-10 20:07:30

标签: r bioconductor

我有两个数据框。我想通过将它与dataframe_2进行比较,找出dataframe_1中每个基因的替代基因名称。

 data_frame_1

  chr   start   end     CNA       Genes       No.of.Gene
   1    13991   1401    gain    Cfh,Gm26048       2
   1    14011   1490    gain    Zfp788,Rik        2

data.frame_2 

Associated_Gene_Name Chromosome_Name Gene_Start Gene_End Associated_Gene_Name_1 Chromosome_Name_1 Gene_Start_1  Gene_End_1
Cfh                     1              13900    14100                CFH            3                  43900     54100
Gm26048                 1              13998    14010                TFE            1                  76710     76790
Zfp788                  2              43970    44180               ELF             4                  131950    133100
Rik                     3             202100    202600               RIK            5                  881100    1036800


data_frame_result

  chr   start   end     CNA       Genes       No.of.Gene    Associated.Gene.name_1
   1    13991   1401    gain    Cfh,Gm26048       2          CFH,TFE
   1    14011   1490    gain    Zfp788,Rik        2          ELF,RIK

1 个答案:

答案 0 :(得分:4)

用逗号分隔多个值确实会让事情变得混乱。这是一个能够正常化"的链条。每行生成一个值的数据,以便您可以执行标准合并。我使用magrittr库链接命令

#test data
data_frame_1<-data.frame(
  Genes=c("Cfh,Gm26048","Gm5852,Gm5773","Elf","Ttn")
)

data_frame_2<-data.frame(
  Genes_1=c("Cfh","Gm26048","Gm5852","Gm5773","Elf","Ttn"),
  Alternate_Gene_name = c("CFH","FGFR","NAA","TFE","ELF","TTN")
)


library(magrittr)

idxstack <- function(x, idx=if(!is.null(names(x))) {names(x)} else {seq_along(x)}) 
    do.call(rbind, Map(function(a,b) cbind.data.frame(idx=a,val=b), idx, x))

as.character(data_frame_1$Genes) %>%
    {setNames(strsplit(., , split=","), .)} %>% 
    idxstack %>% 
    merge(data_frame_2, by.x="val", by.y="Genes_1", all.x=TRUE) %>% 
    aggregate(Alternate_Gene_name~idx, ., paste0, collapse=",") %>%
    merge(data_frame_1,., by.x="Genes", by.y="idx")

返回

          Genes Alternate_Gene_name
1   Cfh,Gm26048            CFH,FGFR
2           Elf                 ELF
3 Gm5852,Gm5773             TFE,NAA
4           Ttn                 TTN