修改

Question

我有三个数据框，包含大量信息和以下行名称：

calloc()

我想要做的是找到中至少 2个数据帧中常见的所有条目（行名称）。即，最终结果应该是单个列表，如下所示：

ENSG00000000971 ENSG00000000971 ENSG00000000971
ENSG00000004139 ENSG00000004139 ENSG00000003987
ENSG00000005001 ENSG00000004848 ENSG00000004848
ENSG00000005102 ENSG00000002330 ENSG00000002330
ENSG00000005486 ENSG00000005102 ENSG00000006047
...             ...             ...

我该怎么做呢？我试过这样做：

ENSG00000000971
ENSG00000004139
ENSG00000004848
ENSG00000005102
ENSG00000002330

...其中三个shared.DESeq2.edgeR = data.frame(row.names(res.DESeq2) %in% row.names(res.edgeR)) shared.DESeq2.limma = data.frame(row.names(res.DESeq2) %in% row.names(res.limma)) shared.edgeR.limma = data.frame(row.names(res.edgeR) %in% row.names(res.limma)) shared = merge(merge(shared.DESeq2.edgeR, shared.DESeq2.limma), shared.edgeR.limma)是三个数据框，但这需要很长时间才能运行（我甚至没有让它完成，所以我不知道它是否真的有用）。我有一些代码为所有三个数据帧共有的元素执行此操作，但我也对两个或更多数据帧中常见的元素感兴趣，但我不能真的找到一个很好的方法来做到这一点。有什么想法吗？

Answer 1

试试这个例子：

#dummy data, with real data we would do: res.DESeq2_rn <-row.names(res.DESeq2)
res.DESeq2_rn <- letters[1:4]
res.edgeR_rn <- letters[3:8]
res.limma_rn <- letters[c(1,3,8,10)]

#get counts
res <- table(c(res.DESeq2_rn, res.edgeR_rn, res.limma_rn))
res
# a b c d e f g h j 
# 2 1 3 2 1 1 1 2 1 

#result
names(res)[ res>=2 ]
#[1] "a" "c" "d" "h"

编辑：基准测试表明@vaettchen的解决方案是赢家！

library(microbenchmark)
library(ggplot2)
# create a large random character vector (this takes a lot of time!)
set.seed(123)
myNames <- sapply(1:1000000,
                  function(i)paste( sample( letters, 8, replace = TRUE ), collapse = "" ))
A <- sample(myNames,1000)
B <- sample(myNames,2000)
C <- sample(myNames,3000)

#benchmarking 3 options
myBench <- microbenchmark(
  Which={
    res <- c(A,B,C)
    out1 <- unique( res[ which( duplicated( res ) ) ] ) },
  Table={ 
    res <- c(A,B,C)
    y <- table( res )
    out2 <- names( y )[ y >= 2 ] },
  Intersect={ 
    out3 <- 
      unique(
        c(intersect(A,B),
          intersect(A,C),
          intersect(B,C)))},
  times=1000)

print(myBench)
qplot(y=time, data=myBench, colour=expr) + scale_y_log10()

Unit: microseconds
      expr       min         lq       mean     median         uq       max neval cld
     Which   266.837   280.4190   527.8266   288.2680   301.2475  59255.34  1000  a 
     Table 32167.286 32739.5945 34851.2260 33072.0825 33524.2550 108176.22  1000   b
 Intersect   450.965   472.3965   667.3316   484.7725   499.8650  60266.54  1000  a

enter image description here

Answer 2

采用@ zx8754的样本数据的另一种方法：

# dummy data
res.DESeq2 <- letters[ 1:4 ]
res.edgeR <- letters[ 3:8 ]
res.limma <- letters[ c( 1, 3, 8, 10 ) ]

# combine into one vector                  
res <- c( res.DESeq2, res.edgeR, res.limma )
res
[1] "a" "b" "c" "d" "c" "d" "e" "f" "g" "h" "a" "c" "h" "j"

# result
unique( res[ which( duplicated( res ) ) ] )
[1] "c" "d" "a" "h"

修改

@ zx8754的答案被接受，出于各种原因，它是干净而优雅的。纯粹是出于我的求知欲，我研究了他和我的大样本方法之间的性能差异，发现它很有趣，可以发布它：

# create a large random character vector (this takes a lot of time!)
res <- rep( "x", 1000000 )
for( i in 1:1000000) 
    res[ i ] <- paste( sample( letters, 8, replace = TRUE ), collapse = "" )
head( res )
[1] "vsvkljgr" "ulxhqnas" "upqqtrdk" "pynuaihp" "srjtnvqm" "mxnlytvd"

# vaettchen:
system.time( x <- unique( res[ which( duplicated( res ) ) ] ) )
 user  system elapsed 
0.173   0.000   0.171 
x
[1] "zlzlwinb" "wielycpx"

# zx8754
system.time( { y <- table( res ); z <- names( y )[ y >= 2 ] } )
  user  system elapsed
18.945   0.020  19.058 
z
[1] "wielycpx" "zlzlwinb"

对于足够大的数据或重复呼叫，差异可能很重要。我的代码的简要说明：

duplicated( res )创建一个长度为res的向量，其中包含逻辑TRUE或FALSE，具体取决于字符串是否重复出现
which( ... )将其转换为索引向量，其值为TRUE
res[ ... ]在索引位置提取res的实际字符值，
unique( ... )将每个字符值减少到只有一个外观，这是@Sajber正在寻找的答案（据我理解）

数据框中的常见元素

2 个答案:

修改