我一直在尝试对数据框中替换NA的最有效方法进行一些测试。
我首先将NA&#39s替换为0百万行,12列数据集的替代解决方案。
将所有支持管道的管道投入microbenchmark
我得到了以下结果。
问题1:有没有办法在benchmark
函数中测试子集左赋值语句(例如:df1 [is.na(df1)]< -0)?
library(dplyr)
library(tidyr)
library(microbenchmark)
set.seed(24)
df1 <- as.data.frame(matrix(sample(c(NA, 1:5), 1e6 *12, replace=TRUE),
dimnames = list(NULL, paste0("var", 1:12)), ncol=12))
op <- microbenchmark(
mut_all_ifelse = df1 %>% mutate_all(funs(ifelse(is.na(.), 0, .))),
mut_at_ifelse = df1 %>% mutate_at(funs(ifelse(is.na(.), 0, .)), .cols = c(1:12)),
# df1[is.na(df1)] <- 0 would sit here, but I can't make it work inside this function
replace = df1 %>% replace(., is.na(.), 0),
mut_all_replace = df1 %>% mutate_all(funs(replace(., is.na(.), 0))),
mut_at_replace = df1 %>% mutate_at(funs(replace(., is.na(.), 0)), .cols = c(1:12)),
replace_na = df1 %>% replace_na(list(var1 = 0, var2 = 0, var3 = 0, var4 = 0, var5 = 0, var6 = 0, var7 = 0, var8 = 0, var9 = 0, var10 = 0, var11 = 0, var12 = 0)),
times = 1000L
)
print(op) #standard data frame of the output
Unit: milliseconds
expr min lq mean median uq max neval
mut_all_ifelse 769.87848 844.5565 871.2476 856.0941 895.4545 1274.5610 1000
mut_at_ifelse 713.48399 847.0322 875.9433 861.3224 899.7102 1006.6767 1000
replace 258.85697 311.9708 334.2291 317.3889 360.6112 455.7596 1000
mut_all_replace 96.81479 164.1745 160.6151 167.5426 170.5497 219.5013 1000
mut_at_replace 96.23975 166.0804 161.9302 169.3984 172.7442 219.0359 1000
replace_na 103.04600 161.2746 156.7804 165.1649 168.3683 210.9531 1000
boxplot(op) #boxplot of output
library(ggplot2) #nice log plot of the output
qplot(y=time, data=op, colour=expr) + scale_y_log10()
为了测试子集赋值运算符,我最初运行了这些测试。
set.seed(24)
> Book1 <- as.data.frame(matrix(sample(c(NA, 1:5), 1e8 *12, replace=TRUE),
+ dimnames = list(NULL, paste0("var", 1:12)), ncol=12))
> system.time({
+ Book1 %>% mutate_all(funs(ifelse(is.na(.), 0, .))) })
user system elapsed
52.79 24.66 77.45
>
> system.time({
+ Book1 %>% mutate_at(funs(ifelse(is.na(.), 0, .)), .cols = c(1:12)) })
user system elapsed
52.74 25.16 77.91
>
> system.time({
+ Book1[is.na(Book1)] <- 0 })
user system elapsed
16.65 7.86 24.51
>
> system.time({
+ Book1 %>% replace_na(list(var1 = 0, var2 = 0, var3 = 0, var4 = 0, var5 = 0, var6 = 0, var7 = 0, var8 = 0, var9 = 0,var10 = 0, var11 = 0, var12 = 0)) })
user system elapsed
3.54 2.13 5.68
>
> system.time({
+ Book1 %>% mutate_at(funs(replace(., is.na(.), 0)), .cols = c(1:12)) })
user system elapsed
3.37 2.26 5.63
>
> system.time({
+ Book1 %>% mutate_all(funs(replace(., is.na(.), 0))) })
user system elapsed
3.33 2.26 5.58
>
> system.time({
+ Book1 %>% replace(., is.na(.), 0) })
user system elapsed
3.42 1.09 4.51
在这些测试中,基础replace()
首先出现。
在基准测试中,replace
在等级中落后,而 tidyr replace_na()
获胜(由鼻子)
重复运行单一测试以及不同形状和大小的数据框始终会在前导中找到基础replace()
。
问题2:它的基准性能如何才能成为迄今为止与简单测试结果脱节的唯一结果?
更令人困惑的是 -
问题3: mutate_all/_at(replace())
如何比简单replace()
更快地工作?
许多人都报告了这一点:http://datascience.la/dplyr-and-a-very-basic-benchmark/(以及该文章中的所有链接)但我仍然没有找到解释为何除了使用散列和C ++之外的原因。)
特别感谢Tyler Rinker:https://www.r-bloggers.com/microbenchmarking-with-r/ 和akrun:https://stackoverflow.com/a/41530071/5088194
答案 0 :(得分:4)
您可以在microbenchmark
中包含复杂/多语句,方法是将其包含{}
,基本上将其转换为单个表达式:
microbenchmark(expr1 = { df1[is.na(df1)] = 0 },
exp2 = { tmp = 1:10; tmp[3] = 0L; tmp2 = tmp + 12L; tmp2 ^ 2 },
times = 10)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# expr1 124953.716 137244.114 158576.030 142405.685 156744.076 284779.353 10 b
# exp2 2.784 3.132 17.748 23.142 24.012 38.976 10 a
值得注意的是这个的副作用:
tmp
#[1] 1 2 0 4 5 6 7 8 9 10
与之相反,比如:
rm(tmp)
microbenchmark(expr1 = { df1[is.na(df1)] = 0 },
exp2 = local({ tmp = 1:10; tmp[3] = 0L; tmp2 = tmp + 12L; tmp2 ^ 2 }),
times = 10)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# expr1 127250.18 132935.149 165296.3030 154509.553 169917.705 314820.306 10 b
# exp2 10.44 12.181 42.5956 54.636 57.072 97.789 10 a
tmp
#Error: object 'tmp' not found
注意到基准测试的副作用,我们发现删除NA
值的第一个操作为以下替代方案留下了相当轻松的工作:
# re-assign because we changed it before
set.seed(24)
df1 = as.data.frame(matrix(sample(c(NA, 1:5), 1e6 * 12, TRUE),
dimnames = list(NULL, paste0("var", 1:12)), ncol = 12))
unique(sapply(df1, typeof))
#[1] "integer"
any(sapply(df1, anyNA))
#[1] TRUE
system.time({ df1[is.na(df1)] <- 0 })
# user system elapsed
# 0.39 0.14 0.53
之前的基准测试给我们留下了:
unique(sapply(df1, typeof))
#[1] "double"
any(sapply(df1, anyNA))
#[1] FALSE
替换NA
时,如果没有,则应考虑在输入中不执行任何操作。
除此之外,请注意,在所有替代方案中,您将“double”(typeof(0)
)子分配给“整数”列 - 向量(sapply(df1, typeof)
)。虽然,我认为没有任何情况(在上述备选方案中)df1
被修改到位(因为在创建“data.frame”之后)存储信息以复制其向量列在修改的情况下),仍然是一个轻微但可避免的开销,强制“加倍”并存储为“双”。在替换“整数”向量中的元素之前的R将分配和复制(在“整数”替换的情况下)或分配和强制(在“双”替换的情况下)。此外,在第一次强制(从基准的副作用,如上所述)之后,R将在“双”运行并且包含比“整数”更慢的操作。我无法找到一种直接的R方法来研究这种差异,但简而言之(存在不完全准确的危险)我们可以通过以下方式模拟这些操作:
# simulate R's copying of int to int
# allocate a new int and copy
int2int = inline::cfunction(sig = c(x = "integer"), body = '
SEXP ans = PROTECT(allocVector(INTSXP, LENGTH(x)));
memcpy(INTEGER(ans), INTEGER(x), LENGTH(x) * sizeof(int));
UNPROTECT(1);
return(ans);
')
# R's coercing of int to double
# 'coerceVector', internally, allocates a double and coerces to populate it
int2dbl = inline::cfunction(sig = c(x = "integer"), body = '
SEXP ans = PROTECT(coerceVector(x, REALSXP));
UNPROTECT(1);
return(ans);
')
# simulate R's copying form double to double
dbl2dbl = inline::cfunction(sig = c(x = "double"), body = '
SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
memcpy(REAL(ans), REAL(x), LENGTH(x) * sizeof(double));
UNPROTECT(1);
return(ans);
')
在基准测试中:
x.int = 1:1e7; x.dbl = as.numeric(x.int)
microbenchmark(int2int(x.int), int2dbl(x.int), dbl2dbl(x.dbl), times = 50)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# int2int(x.int) 16.42710 16.91048 21.93023 17.42709 19.38547 54.36562 50 a
# int2dbl(x.int) 35.94064 36.61367 47.15685 37.40329 63.61169 78.70038 50 b
# dbl2dbl(x.dbl) 33.51193 34.18427 45.30098 35.33685 63.45788 75.46987 50 b
结束(!)整个前一个音符,将0
替换为0L
将节省一些时间......
最后,为了更公平地复制基准,我们可以使用:
library(dplyr)
library(tidyr)
library(microbenchmark)
set.seed(24)
df1 = as.data.frame(matrix(sample(c(NA, 1:5), 1e6 * 12, TRUE),
dimnames = list(NULL, paste0("var", 1:12)), ncol = 12))
包装功能:
stopifnot(ncol(df1) == 12) #some of the alternatives are hardcoded to 12 columns
mut_all_ifelse = function(x, val) x %>% mutate_all(funs(ifelse(is.na(.), val, .)))
mut_at_ifelse = function(x, val) x %>% mutate_at(funs(ifelse(is.na(.), val, .)), .cols = c(1:12))
baseAssign = function(x, val) { x[is.na(x)] <- val; x }
baseFor = function(x, val) { for(j in 1:ncol(x)) x[[j]][is.na(x[[j]])] = val; x }
base_replace = function(x, val) x %>% replace(., is.na(.), val)
mut_all_replace = function(x, val) x %>% mutate_all(funs(replace(., is.na(.), val)))
mut_at_replace = function(x, val) x %>% mutate_at(funs(replace(., is.na(.), val)), .cols = c(1:12))
myreplace_na = function(x, val) x %>% replace_na(list(var1 = val, var2 = val, var3 = val, var4 = val, var5 = val, var6 = val, var7 = val, var8 = val, var9 = val, var10 = val, var11 = val, var12 = val))
在基准测试前测试结果是否相等:
identical(mut_all_ifelse(df1, 0), mut_at_ifelse(df1, 0))
#[1] TRUE
identical(mut_at_ifelse(df1, 0), baseAssign(df1, 0))
#[1] TRUE
identical(baseAssign(df1, 0), baseFor(df1, 0))
#[1] TRUE
identical(baseFor(df1, 0), base_replace(df1, 0))
#[1] TRUE
identical(base_replace(df1, 0), mut_all_replace(df1, 0))
#[1] TRUE
identical(mut_all_replace(df1, 0), mut_at_replace(df1, 0))
#[1] TRUE
identical(mut_at_replace(df1, 0), myreplace_na(df1, 0))
#[1] TRUE
强制执行“加倍”测试:
benchnum = microbenchmark(mut_all_ifelse(df1, 0),
mut_at_ifelse(df1, 0),
baseAssign(df1, 0),
baseFor(df1, 0),
base_replace(df1, 0),
mut_all_replace(df1, 0),
mut_at_replace(df1, 0),
myreplace_na(df1, 0),
times = 10)
benchnum
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# mut_all_ifelse(df1, 0) 1368.5091 1441.9939 1497.5236 1509.2233 1550.1416 1629.6959 10 c
# mut_at_ifelse(df1, 0) 1366.1674 1389.2256 1458.1723 1464.5962 1503.4337 1553.7110 10 c
# baseAssign(df1, 0) 532.4975 548.9444 586.8198 564.3940 655.8083 667.8634 10 b
# baseFor(df1, 0) 169.6048 175.9395 206.7038 189.5428 197.6472 308.6965 10 a
# base_replace(df1, 0) 518.7733 547.8381 597.8842 601.1544 643.4970 666.6872 10 b
# mut_all_replace(df1, 0) 169.1970 183.5514 227.1978 194.0903 291.6625 346.4649 10 a
# mut_at_replace(df1, 0) 176.7904 186.4471 227.3599 202.9000 303.4643 309.2279 10 a
# myreplace_na(df1, 0) 172.4926 177.8518 199.1469 186.3645 192.1728 297.0419 10 a
在不胁迫“双倍”的情况下进行测试:
benchint = microbenchmark(mut_all_ifelse(df1, 0L),
mut_at_ifelse(df1, 0L),
baseAssign(df1, 0L),
baseFor(df1, 0L),
base_replace(df1, 0L),
mut_all_replace(df1, 0L),
mut_at_replace(df1, 0L),
myreplace_na(df1, 0L),
times = 10)
benchint
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# mut_all_ifelse(df1, 0L) 1291.17494 1313.1910 1377.9265 1353.2812 1417.4389 1554.6110 10 c
# mut_at_ifelse(df1, 0L) 1295.34053 1315.0308 1372.0728 1353.0445 1431.3687 1478.8613 10 c
# baseAssign(df1, 0L) 451.13038 461.9731 477.3161 471.0833 484.9318 528.4976 10 b
# baseFor(df1, 0L) 98.15092 102.4996 115.7392 107.9778 136.2227 139.7473 10 a
# base_replace(df1, 0L) 428.54747 451.3924 471.5011 470.0568 497.7088 516.1852 10 b
# mut_all_replace(df1, 0L) 101.66505 102.2316 137.8128 130.5731 161.2096 243.7495 10 a
# mut_at_replace(df1, 0L) 103.79796 107.2533 119.1180 112.1164 127.7959 166.9113 10 a
# myreplace_na(df1, 0L) 100.03431 101.6999 120.4402 121.5248 137.1710 141.3913 10 a
一种可视化的简单方法:
boxplot(benchnum, ylim = range(min(summary(benchint)$min, summary(benchnum)$min),
max(summary(benchint)$max, summary(benchnum)$max)))
boxplot(benchint, add = TRUE, border = "red", axes = FALSE)
legend("topright", c("coerce", "not coerce"), fill = c("black", "red"))
请注意,df1
之后str(df1)
没有变化。{/ 1}}。