r - 有没有一种更快速的方法来对稀疏矩阵进行子集化而不是＆＃39; [＆＃39;？ - Thinbug

有没有一种更快速的方法来对稀疏矩阵进行子集化而不是＆＃39; [＆＃39;？

时间：2016-09-23 23:04:52

标签： r matrix sparse-matrix

我是seqMeta软件包的维护者，正在寻找有关如何加速将大矩阵分成大块的瓶颈的想法。

背景

seqMeta包用于分析遗传数据。所以你有一组科目（n_subject）和一些遗传标记（n_snps）。这导致n_subject x n_snp矩阵（Z）。还有一个数据框可以告诉您哪些snps组合在一起进行分析（通常哪些snps包含给定的基因）。

虽然Z可能很大，但它很稀疏。通常，小于10％，有时约2％的值是非零的。 sprase矩阵表示似乎是节省空间的明显选择。

当前项目：nsubjects~15,000和nsnps~2百万，分割超过200,000。

随着数据量的不断增长，我发现时间限制因素往往是分组数，而不是数据的实际大小。（参见下面的示例，运行时是n_splits的线性函数，用于相同的数据）

简化示例

library(Matrix)

seed(1)

n_subjects <- 1e3
n_snps <- 1e5
sparcity <- 0.05


n <- floor(n_subjects*n_snps*sparcity) 

# create our simulated data matrix
Z <- Matrix(0, nrow = n_subjects, ncol = n_snps, sparse = TRUE)
pos <- sample(1:(n_subjects*n_snps), size = n, replace = FALSE)
vals <- rnorm(n)
Z[pos] <- vals

# create the data frame on how to split
# real data set the grouping size is between 1 and ~1500
n_splits <- 500
sizes <- sample(2:20, size = n_splits, replace = TRUE)  
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
                   snps = sample(n_snps, size = sum(sizes)))

# simple function that gets called on the split
# the real function creates a cols x cols dense upper triangular matrix
# similar to a covariance matrix
simple_fun <- function(Z, cols) {sum(Z[ , cols])}

# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)

##    user  system elapsed 
##    3.21    0.00    3.21  

n_splits <- 1000
sizes <- sample(2:20, size = n_splits, replace = TRUE)  
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
                   snps = sample(n_snps, size = sum(sizes)))

# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)

##    user  system elapsed 
##    6.38    0.00    6.38

n_splits <- 5000
sizes <- sample(2:20, size = n_splits, replace = TRUE)  
lkup <- data.frame(gene_name=rep(paste0("g", 1:n_splits), times = sizes),
                   snps = sample(n_snps, size = sum(sizes)))

# split our matrix based look up table
system.time(
res <- tapply(lkup[ , "snps"], lkup[ , "gene_name"], FUN=simple_fun, Z=Z, simplify = FALSE)
)

##    user  system elapsed 
##   31.65    0.00   31.66

问题：是否有更快的方式来分配矩阵而不是＆＃39; [＆＃39;？或者其他人接近我错过了？

0 个答案:

没有答案