R中包含多个逗号分隔条目的唯一行

时间:2011-11-10 17:48:42

标签: r reshape

背景:我正在从一个生物体中的GWAS注释SNP而没有太多注释。我正在使用来自UCSC的链式tBLASTn表以及biomaRt将每个SNP映射到可能的基因。

我有一个如下所示的数据框:

            SNP   hu_mRNA     gene
 chr1.111642529 NM_002107    H3F3A
 chr1.111642529 NM_005324    H3F3B
 chr1.111801684 BC098118     <NA>
 chr1.111925084 NM_020435    GJC2
  chr1.11801605 AK027740     <NA>
  chr1.11801605 NM_032849    C13orf33
 chr1.151220354 NM_018913    PCDHGA10
 chr1.151220354 NM_018918    PCDHGA5

我想最终得到的是每个SNP的单行,逗号分隔基因和hu_mRNA。这就是我所追求的:

            SNP            hu_mRNA    gene
 chr1.111642529 NM_002107,NM_005324   H3F3A
 chr1.111801684  BC098118,NM_020435   GJC2
  chr1.11801605  AK027740,NM_032849   C13orf33
 chr1.151220354 NM_018913,NM_018918   PCDHGA10,PCDHGA5

现在我知道我可以用perl中的手腕轻弹做这个,但我真的想在R中做这一切。有什么建议吗?

5 个答案:

答案 0 :(得分:8)

您可以aggregatepaste一起使用,最后merge使用x <- structure(list(SNP = structure(c(1L, 1L, 2L, 3L, 4L, 4L, 5L, 5L), .Label = c("chr1.111642529", "chr1.111801684", "chr1.111925084", "chr1.11801605", "chr1.151220354"), class = "factor"), hu_mRNA = structure(c(3L, 4L, 2L, 7L, 1L, 8L, 5L, 6L), .Label = c("AK027740", "BC098118", "NM_002107", "NM_005324", "NM_018913", "NM_018918", "NM_020435", "NM_032849"), class = "factor"), gene = structure(c(4L, 5L, 1L, 3L, 1L, 2L, 6L, 7L), .Label = c("<NA>", "C13orf33", "GJC2", "H3F3A", "H3F3B", "PCDHGA10", "PCDHGA5"), class = "factor")), .Names = c("SNP", "hu_mRNA", "gene"), class = "data.frame", row.names = c(NA, -8L )) a1 <- aggregate(hu_mRNA~SNP,data=x,paste,sep=",") a2 <- aggregate(gene~SNP,data=x,paste,sep=",") merge(a1,a2) SNP hu_mRNA gene 1 chr1.111642529 NM_002107, NM_005324 H3F3A, H3F3B 2 chr1.111801684 BC098118 <NA> 3 chr1.111925084 NM_020435 GJC2 4 chr1.11801605 AK027740, NM_032849 <NA>, C13orf33 5 chr1.151220354 NM_018913, NM_018918 PCDHGA10, PCDHGA5

{{1}}

答案 1 :(得分:8)

您可以使用plyr在一行中执行此操作,因为这是一个典型的split-apply-combine问题。您使用SNP进行拆分,将paste应用于collapse并将这些部分组合回数据框。

plyr::ddply(x, .(SNP), colwise(paste), collapse = ",")

如果您想在data对R进行flick of a wrist重塑,请了解plyrreshape2 :)。使用data.table进行手腕解决方案的另一次轻弹,如果您正在处理大量数据,这非常有用。

data.table::data.table(x)[,lapply(.SD, paste, collapse = ","),'SNP']

答案 2 :(得分:4)

首先设置测试数据。请注意,我们使用"character"使列成为"factor"类,而不是as.is=TRUE

Lines <- "SNP   hu_mRNA     gene
 chr1.111642529 NM_002107    H3F3A
 chr1.111642529 NM_005324    H3F3B
 chr1.111801684 BC098118     <NA>
 chr1.111925084 NM_020435    GJC2
  chr1.11801605 AK027740     <NA>
  chr1.11801605 NM_032849    C13orf33
 chr1.151220354 NM_018913    PCDHGA10
 chr1.151220354 NM_018918    PCDHGA5"
cat(Lines, "\n", file = "data.txt")
DF <- read.table("data.txt", header = TRUE, na.strings = "<NA>", as.is = TRUE)

现在尝试这个aggregate声明:

> aggregate(. ~ SNP, DF, toString)
             SNP              hu_mRNA              gene
1 chr1.111642529 NM_002107, NM_005324      H3F3A, H3F3B
2 chr1.111925084            NM_020435              GJC2
3  chr1.11801605            NM_032849          C13orf33
4 chr1.151220354 NM_018913, NM_018918 PCDHGA10, PCDHGA5

答案 3 :(得分:1)

这也可以使用reshape2的{​​{1}}和melt操作来解决。通过这种方法,dcast将数据转换为&#34; long&#34;首先格式化,然后使用相同的操作melt编辑值dcast

paste(..., collapse = ",")

答案 4 :(得分:0)

这是一个dplyr解决方案,IHMO最具可读性:

library(dplyr)

x %>%
  group_by(SNP) %>%
  summarize(
    genes = paste(gene, collapse = ','),
    hu_mRNA = paste(hu_mRNA, collapse = ',')
  )

结果:

Source: local data frame [5 x 3]

             SNP            genes             hu_mRNA
          (fctr)            (chr)               (chr)
1 chr1.111642529      H3F3A,H3F3B NM_002107,NM_005324
2 chr1.111801684             <NA>            BC098118
3 chr1.111925084             GJC2           NM_020435
4  chr1.11801605    <NA>,C13orf33  AK027740,NM_032849
5 chr1.151220354 PCDHGA10,PCDHGA5 NM_018913,NM_018918