优化R中的比较

时间:2016-02-03 13:36:43

标签: r performance optimization comparison

我还是R编程的新手,我需要优化我的部分代码。我将在下面解释它是如何工作的。

我当前的代码太慢了

myfunc <- function(dt){
    indexes = which(dt$time == CURRENT)

    for(i in indexes){
        # columns foo, bar & baz are used to build rowname
        # and colnames
        linename = paste(dt$foo[i], "_", dt$bar[i], sep="")
        colname  = dt$baz[i]

        # related_var is the name of an other global var
        # and value is the corresponding value in
        # related_var[linename, colname]
        dt$value[i] = get(dt$related_var[i])[[linename, colname]]
    }
    return(dt)
}

我该如何使用它?

这不是我的代码部分所以我只是将其简化了

CURRENT = 0
MAX     = 1000
for(i in 1:MAX){
    doSomeStuffOnGlobalVars()
    # get datas from global var for this CURRENT
    dt = myfunc(dt)
    CURRENT = CURRENT + 1
}

一些解释

CURRENT (like 1,2,3,4,5,... 1000)的所有值调用此函数,我们希望在$value中为dt的每一行更新dt$time == CURRENT,而事情就是变量“ varname“每CURRENT

修改一次
dt : a data.table ordered by time in the form of
    foo   bar   baz   time   related_var   value
    1     1   "toto"  1      "varname"      NA
    1     2   "toto"  1      "varname"      NA
    2     1   "tata"  1      "varname"      NA
    2     8   "toto"  1      "varname"      NA
    ...

related_var : contain the name of a global data.frame which have its 
    colnames defined by baz in dt 
    rownames defined by a combination of foo & bar (foo_bar) in dt


example of "varname" variable:
          toto   tata
    1_1    1.6    2
    1_2    42   1337
    ...    ...    ...
    10_10    3.14   1.61

我已经做了一些更改(我在data.framedata.table之前使用了eval(parse(...))但是这仍然很慢(dt约为5s,约有5000行),我是如果你有想法(R或纯算法)

,我想知道如何优化它

N.B。告诉我它是否过于神秘

编辑:我发现慢速部分是dt$value[i] = get(dt$related_var[i])[[linename, colname]],如果我进行像justAvar = get(dt$related_var[i])[[linename, colname]]这样的简单分配,速度就会快得多,所以现在我的问题是:“R如何通过索引?如果我想去index = 15,R是否会通过所有14个前面的元素?“

1 个答案:

答案 0 :(得分:0)

首先,我会预先计算linename,我怀疑它几乎是为整个数据表计算的。将使用 data.table 参考魔法。第二,内联和简化功能。最后,使用 data.table [i,j,by] approach

dt <- ...

dt[, linename := paste(foo, "_", bar, sep="")]

CURRENT <- 0
MAX     <- 1000
for(i in 1:MAX) {
    doSomeStuffOnGlobalVars()
    # get datas from global var for this CURRENT

    dt[time == CURRENT, value := get(related_var)[[linename, baz]]]
    CURRENT <- CURRENT + 1
}

更新

有用的读物​​:http://www.r-bloggers.com/strategies-to-speedup-r-code/

更新II

也可以在循环之前为dt设置关键

setkey(dt, time)