提高表格交叉效率

时间:2017-02-02 14:34:03

标签: r intersect

我有2张桌子。他们都是染色体的形式,这个染色体的起点和终点坐标。第一个表包含基因,第二个表包含可能会或可能不会落入这些基因的短序列。在我的真实数据集中,基因大约有50.000行,序列大约有7.000.000行,并且两个表都有各种额外的列。我想在两个表之间找到重叠。

chromosome=as.character(rep(c(1,2,3,4,5), each=10000))
start=floor(runif(50000, min=0, max=50000000))
end=start+floor(runif(10000, min=0, max=10000))
genes=cbind(chromosome, start, end)

startseq=floor(runif(7000000, min=0, max=50000000))
endseq=startseq+4
sequences=cbind(chromosome, startseq, endseq)

我试图使用以下方法找到所有相交:

for (g in 1:nrow(sequences)) {
  seqrow=as.vector(sequences[g,])  
  rownr=which(genes[,1]==seqrow[1] & genes[,2] < seqrow[2] & genes[,3] > seqrow[3])
  print(rownr)
}

我打算使用这些行号对我真实数据集中的额外列执行操作。现在的问题是所描述的过程相当慢。我可以通过哪些方式加快这种交叉?

1 个答案:

答案 0 :(得分:1)

您希望bioconductor用于此任务,特别是GenomicRanges包。这将返回类&#34; Hits&#34;的对象。它将包含重叠的索引。您也可以使用intersect函数,但这会返回相交的间隔而不是相交seq的id。简而言之,bioconductor和GenomicRanges有许多有用的设置函数,它们非常快。

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite()
biocLite("GenomicRanges") ## I think genomicranges is part of the standard bioconductor install but if not this will install it.


library(GenomicRanges)

set.seed(8675309)
chromosome <- as.character(rep(c(1,2,3,4,5), each=10000))
start <- floor(runif(50000, min=0, max=50000000))
end <- start+floor(runif(10000, min=0, max=10000))
genes <- cbind(chromosome, start, end)

startseq <- floor(runif(7000000, min=0, max=50000000))
endseq <- startseq+4
chromosome <- sample(c(1,2,3,4,5), size = 7000000, replace=T)
sequences=cbind(chromosome, startseq, endseq)

genes <- GRanges(seqnames = chromosome, ranges = IRanges(start = start, end = end))
seqs <- GRanges(seqnames = chromosome, ranges = IRanges(start = startseq, end = endseq))

x <- findOverlaps(seqs, genes)
head(x)

#Hits object with 6 hits and 0 metadata columns:
#      queryHits subjectHits
#      <integer>   <integer>
#  [1]         2       41673
#  [2]         2       47476
#  [3]         3       20048
#  [4]         4        9624
#  [5]         4        5662
#  [6]         4        1531
#  -------
#  queryLength: 7000000
#  subjectLength: 50000