问题

Question

问题

我正在寻找一种快速（理想的恒定时间）方式来在R中将大的原始矢量切成大片。例如：

obj <- raw(2^32)
obj[seq_len(2^31 - 1)]

即使使用ALTREP，基数R也花费太长时间。

system.time(obj[seq_len(2^31 - 1)])
#>    user  system elapsed 
#>  19.470  38.853 148.288

为什么？

因为我尝试按speed up storr的顺序speed up drake。我希望storr可以更快地保存长原始向量。 writeBin()超级快，但是still cannot handle vectors more than 2^31 - 1 bytes long。因此，我想将数据以described here的形式保存在可管理的块中。这几乎可以工作，但是creating the chunks is too slow会在内存中复制过多的数据。

想法

让我们创建一个函数

slice_raw <- function(obj, from, to) {
  # ???
}

基本上等同于

obj[seq(from, to, by = 1L)]

，在时间和内存上均为O（1）。从理论上讲，我们要做的就是

将obj传递给C函数。
创建一个指向obj第一个字节的新指针。
增加指向切片开头的新指针。
在新指针上创建一个RAWSXP，并具有适当的长度（小于2 ^ 31字节）。
返回RAWSXP。

我有C语言背景，但是我很难完全控制R's internals。我想访问SEXP内的C指针，以便可以执行basic pointer arithmetic并从未经修饰的C指针创建已知长度的R向量。我在R的C内部构件上找到的资源似乎没有解释如何包装或拆包指针。我们需要Rcpp吗？

以下粗略的草图可以说明我要做什么。

library(inline)
sig <- c(
  x = "raw",         # Long raw vector with more than 2^31 - 1 bytes.
  start = "integer", # Should probably be R_xlen_t.
  bytes = "integer"  # <= 2^31 - 1. Ideally coercible to R_xlen_t.
)
body <- "
Rbyte* result;           // Just a reference. Want to avoid copying data.
result = RAW(x) + start; // Trying to do ordinary pointer arithmetic.
return asRaw(result);    // Want to return a raw vector of length `bytes`.
"
slice_raw <- cfunction(sig = sig, body = body)

编辑：更多潜在的解决方法

感谢Dirk激发了我对此的思考。对于足够小的数据，我们可以使用fst保存一个单列数据帧，其中列是我们实际上关心的原始向量。 fst的使用比writeBin()

快

library(fst)
wrapper <- data.frame(actual_data = raw(2^31 - 1))
system.time(write_fst(wrapper, tempfile()))
#>    user  system elapsed 
#>   0.362   0.019   0.103
system.time(writeBin(wrapper$actual_data, tempfile()))
#>    user  system elapsed 
#>   0.314   1.340   1.689

^{由reprex package（v0.3.0）于2019-06-16创建}

不幸的是，很难创建具有2 ^ 31或更多行的数据帧。一种方法是首先将原始向量转换为矩阵，并且避免通常的整数溢出，因为（2 ^ 31-1）^ 2字节是几EB。

library(fst)
x <- raw(2^32)
m <- matrix(x, nrow = 2^16, ncol = 2^16)
system.time(write_fst(as.data.frame(m), tempfile()))
#>    user  system elapsed 
#>   8.776   1.459   9.519

^{由reprex package（v0.3.0）于2019-06-16创建}

我们仍然将saveRDS()留在尘土中，但我们不再击败writeBin()。从数据帧到矩阵的转换速度很慢，我不确定它能否很好地扩展。

library(fst)
x <- raw(2^30)
m <- matrix(x, nrow = 2^15, ncol = 2^15)
system.time(write_fst(as.data.frame(m), tempfile()))
#>    user  system elapsed 
#>   1.998   0.408   2.409
system.time(writeBin(as.raw(m), tempfile()))
#>    user  system elapsed 
#>   0.329   0.839   1.397

^{由reprex package（v0.3.0）于2019-06-16创建}

如果像Dirk所建议的那样，我们可以使用R_xlen_t来索引数据帧的行，那么我们也许可以避免任何转换。

Answer 1

尽管当前对带有长矢量列的data.frame的支持不是很好，但是您仍然可以使用fst来序列化长矢量：

# method for writing a raw vector to disk
write_raw <- function(x, path, compress = 50) {

  # create a list and add required attributes
  y <- list(X = x)
  attributes(y) <- c(attributes(y), class = "data.frame")

  # serialize and compress to disk
  fst::write_fst(y, path, compress)
}

# create raw vector of length >2^31
x <- rep(as.raw(0:255), 2^23 + 10)

# write raw vector
write_raw(x, "raw_vector.fst", 100)

使用此方案，无需将向量分成多个部分（正如您已经指出的那样，这将大大减慢序列化的速度）。原始向量可以重新读取，而无需任何复制或切片：

# method for reading a raw vector from disk
read_raw <- function(path) {

  # read from disk
  z <- fst::read_fst(path)

  z$X
}

z <- read_raw("raw_vector.fst")

fst::hash_fst(x) == fst::hash_fst(z)
#> [1] TRUE TRUE

（请注意，目前您需要第一个开发版本才能通过长向量支持进行阅读）

在您的设置中，您将始终将完整的原始向量序列化到整个磁盘上（就像saveRDS()一样。由于不需要随机访问存储的向量，因此存储在fst中的元数据文件可能有点过分。您还可以测试一种设置，其中使用compress_fst()压缩原始矢量，然后使用saveRDS(raw_vec, compress = FALSE)存储结果。

这种设置的优点是，压缩器可以使用更大的块进行压缩，从而提高压缩率（效果可能很明显）。使用较大的块还可以加快压缩速度。

另一方面，缺点是您在写入磁盘期间没有像write_fst()那样进行压缩，因此这种效果可能会减慢序列化的速度。而且您再也没有随机访问权限了，但是您实际上并不需要它。

如果您执行两步过程（首先压缩数据，然后序列化数据），则用户可以选择使用不同的压缩器（例如，压缩率非常高的较慢的压缩器，磁盘）。

Answer 2

面临同样的挑战。这是完成任务的小型Rcpp函数

Rcpp::RawVector raw_slice(
  const Rcpp::RawVector &x, 
  const R_xlen_t offset, 
  const R_xlen_t size) {

  Rcpp::RawVector result = Rcpp::no_init(size);
  memcpy ( &result[0], &x[offset - 1], size );
  return result;
}

切片（原始）矢量的更快方法？

问题

为什么？

想法

编辑：更多潜在的解决方法

2 个答案: