如何遍历列的所有组合并在R中按组应用功能?

时间:2018-11-01 00:37:00

标签: r data.table

我有以下名为data.table的{​​{1}}

dt

我的目标是通过 set.seed(1) dt <- data.table(expand.grid(c("a","b"),1:2,1:2,c("M","N","O","P","Q"))) dt$perf <- rnorm(nrow(dt),0,.01) colnames(dt) <- c("ticker","par1","par2","row_names","perf") 遍历par1par2的所有组合,并选择最大化row_names的组合。 让我们看一下数据,以便在视觉上更有意义。

cumprod(mean(perf)+1)-1

对于每个dt[order(row_names,ticker,par1,par2)] ticker par1 par2 row_names perf 1: a 1 1 M 0.011462284 2: a 1 2 M -0.004252677 3: a 2 1 M 0.005727396 4: a 2 2 M -0.003892372 5: b 1 1 M -0.024030962 6: b 1 2 M 0.009510128 7: b 2 1 M 0.003747244 8: b 2 2 M -0.002843307 ticker,我们有row_names2 x 2 = 4的{​​{1}}组合,即par1

我想计算与{{1}关联的par2的{​​{1}},并将所有(1,1) (1,2) (2,1) (2,2)mean的所有其他组合关联。使用上图中的数字,

perf

然后,我们对ticker = a, par1 = 1, par2 = 1重复此过程,并对perf进行所有其他组合。

我们将对ticker = bres a_perf b_perf 1: 0.01146228 -0.024030962 2: 0.01146228 0.009510128 3: 0.01146228 0.003747244 4: 0.01146228 -0.002843307 apply(res,1,mean) [1] -0.006284339 0.010486206 0.007604764 0.004309488 与所有ticker = a, par1 = 1, par2 = 2的所有组合重复此过程。

编辑:::使用@earch的建议,我们得到以下信息:

ticker = b

在这里,我想为par1选择par2。一种实现方法是,如果我以后不关心引用索引的话,

row_names

这将是我如何计算完成所需的最终结果:

tmp <- lapply(split(dt, dt$row_names), calcCombMeans)
$M
   a.row b.row          mean
1      1     2 -0.0022140524
2      3     2 -0.0032599264
3      5     2  0.0025657555
4      7     2  0.0033553619
5      1     4  0.0048441350
6      3     4  0.0037982609
7      5     4  0.0096239429
8      7     4  0.0104135493
9      1     6 -0.0072346110
10     3     6 -0.0082804850
11     5     6 -0.0024548031
12     7     6 -0.0016651967
13     1     8  0.0005593545
14     3     8 -0.0004865195
15     5     8  0.0053391624
16     7     8  0.0061287688

我还没有决定我是否需要max(mean)信息(为了复制特定row_names M,N,O,P,Q的精确计算,我可能会决定),但是使用此res <- sapply(1:length(tmp),function(i) which.max(tmp[[i]]$perf)) [1] 8 6 3 12 16 ,我可以通过执行以下操作来计算我的res <- rbindlist(tmp,id="row_names") res <- res[,list(best=max(perf),best_idx = which.max(perf)),by=row_names] row_names best best_idx 1: M 0.010413549 8 2: N 0.009508122 6 3: O 0.009314068 3 4: P 0.008883106 12 5: Q 0.009316006 16

best_idx

@earch的确可以帮助您查看计算所有这些组合的过程。我想知道通过使用row_names的功能是否有更有效的解决方案。我的真实数据集比这个数据集(数百万行)大得多,并且组合将开始造成损失。

编辑#2 :::在能够逐步完成该过程之后,我想出了一个非常快速的解决方案!

res

使用cumRet使我可以按组和代码组合计算最大性能。然后,完成后,我可以按res[,cumRet:= cumprod(best+1)-1] > res row_names best best_idx cumRet 1: M 0.010413549 8 0.01041355 2: N 0.009508122 6 0.02002068 3: O 0.009314068 3 0.02952123 4: P 0.008883106 12 0.03866657 5: Q 0.009316006 16 0.04834280 分组。并得到相同的结果!

data.table

2 个答案:

答案 0 :(得分:1)

我不确定累积乘积要取什么值,但这是一个函数,用于计算row_names中a和b的所有性能组合之间的平均值。它应该为您提供完成任务所需的条件:

calcCombMeans <- function(dt) {
  a.rows <- which(dt$ticker == "a")
  b.rows <- which(dt$ticker == "b")
  rep.rows <- expand.grid(a.row = a.rows, b.row = b.rows)
  rep.rows$mean <- sapply(1:nrow(rep.rows), function(i) {
    mean(dt$perf[unlist(rep.rows[i, ])])
  })
  dt$means <- lapply(1:nrow(dt), function(i) {
    if(dt$ticker[i] == "a") {
      filter(rep.rows, a.row == i)$mean
    } else {
      filter(rep.rows, b.row == i)$mean
    }
  })
  dt
}

do.call(rbind, lapply(split(dt, dt$row_names), calcCombMeans))

    ticker par1 par2 row_names          perf
 1:      a    1    1         M -0.0062645381
 2:      b    1    1         M  0.0018364332
 3:      a    2    1         M -0.0083562861
 4:      b    2    1         M  0.0159528080
 5:      a    1    2         M  0.0032950777
 6:      b    1    2         M -0.0082046838
 7:      a    2    2         M  0.0048742905
 8:      b    2    2         M  0.0073832471
 9:      a    1    1         N  0.0057578135
10:      b    1    1         N -0.0030538839
11:      a    2    1         N  0.0151178117
12:      b    2    1         N  0.0038984324
13:      a    1    2         N -0.0062124058
14:      b    1    2         N -0.0221469989
15:      a    2    2         N  0.0112493092
16:      b    2    2         N -0.0004493361
17:      a    1    1         O -0.0001619026
18:      b    1    1         O  0.0094383621
19:      a    2    1         O  0.0082122120
20:      b    2    1         O  0.0059390132
21:      a    1    2         O  0.0091897737
22:      b    1    2         O  0.0078213630
23:      a    2    2         O  0.0007456498
24:      b    2    2         O -0.0198935170
25:      a    1    1         P  0.0061982575
26:      b    1    1         P -0.0005612874
27:      a    2    1         P -0.0015579551
28:      b    2    1         P -0.0147075238
29:      a    1    2         P -0.0047815006
30:      b    1    2         P  0.0041794156
31:      a    2    2         P  0.0135867955
32:      b    2    2         P -0.0010278773
33:      a    1    1         Q  0.0038767161
34:      b    1    1         Q -0.0005380504
35:      a    2    1         Q -0.0137705956
36:      b    2    1         Q -0.0041499456
37:      a    1    2         Q -0.0039428995
38:      b    1    2         Q -0.0005931340
39:      a    2    2         Q  0.0110002537
40:      b    2    2         Q  0.0076317575
    ticker par1 par2 row_names          perf
                                                      means
 1: -0.0022140524, 0.0048441350,-0.0072346110, 0.0005593545
 2:     -0.002214052,-0.003259926, 0.002565755, 0.003355362
 3: -0.0032599264, 0.0037982609,-0.0082804850,-0.0004865195
 4:         0.004844135,0.003798261,0.009623943,0.010413549
 5:      0.002565755, 0.009623943,-0.002454803, 0.005339162
 6:     -0.007234611,-0.008280485,-0.002454803,-0.001665197
 7:      0.003355362, 0.010413549,-0.001665197, 0.006128769
 8:  0.0005593545,-0.0004865195, 0.0053391624, 0.0061287688
 9:      0.001351965, 0.004828123,-0.008194593, 0.002654239
10:      0.001351965, 0.006031964,-0.004633145, 0.004097713
11:      0.006031964, 0.009508122,-0.003514594, 0.007334238
12:      0.004828123, 0.009508122,-0.001156987, 0.007573871
13:     -0.004633145,-0.001156987,-0.014179702,-0.003330871
14:     -0.008194593,-0.003514594,-0.014179702,-0.005448845
15:      0.004097713, 0.007573871,-0.005448845, 0.005399987
16:      0.002654239, 0.007334238,-0.003330871, 0.005399987
17:      0.004638230, 0.002888555, 0.003829730,-0.010027710
18:         0.004638230,0.008825287,0.009314068,0.005092006
19:      0.008825287, 0.007075613, 0.008016787,-0.005840653
20:         0.002888555,0.007075613,0.007564393,0.003342332
21:      0.009314068, 0.007564393, 0.008505568,-0.005351872
22:         0.003829730,0.008016787,0.008505568,0.004283506
23:      0.005092006, 0.003342332, 0.004283506,-0.009573934
24:     -0.010027710,-0.005840653,-0.005351872,-0.009573934
25:      0.002818485,-0.004254633, 0.005188837, 0.002585190
26:      0.002818485,-0.001059621,-0.002671394, 0.006512754
27:     -0.001059621,-0.008132739, 0.001310730,-0.001292916
28: -0.0042546332,-0.0081327395,-0.0097445122,-0.0005603642
29: -0.0026713940,-0.0097445122,-0.0003010425,-0.0029046889
30:  0.0051888365, 0.0013107303,-0.0003010425, 0.0088831056
31:  0.0065127541,-0.0005603642, 0.0088831056, 0.0062794591
32:      0.002585190,-0.001292916,-0.002904689, 0.006279459
33:  0.0016693329,-0.0001366148, 0.0016417911, 0.0057542368
34:      0.001669333,-0.007154323,-0.002240475, 0.005231102
35:     -0.007154323,-0.008960271,-0.007181865,-0.003069419
36: -0.0001366148,-0.0089602706,-0.0040464226, 0.0034251540
37:     -0.002240475,-0.004046423,-0.002268017, 0.001844429
38:      0.001641791,-0.007181865,-0.002268017, 0.005203560
39:         0.005231102,0.003425154,0.005203560,0.009316006
40:      0.005754237,-0.003069419, 0.001844429, 0.009316006
                                                      means

答案 1 :(得分:0)

编辑#2 :::在能够逐步完成该过程之后,我想出了一个非常快速的解决方案!

tmp <- dt[,list(par1=par1[which.max(perf)],par2=par2[which.max(perf)],
                                           perf=max(perf)),
                                           by=list(ticker,row_names)]
res <- tmp[,list(perf=mean(perf),par1= paste(par1,collapse=","),
                                          par2=paste(par2,collapse=",")),by=row_names]

使用data.table使我可以按组和代码组合计算最大性能。然后,完成后,我可以按row_names分组。并得到相同的结果!

> res
   row_names        perf par1 par2
1:         M 0.010413549  2,2  2,1
2:         N 0.009508122  2,2  1,1
3:         O 0.009314068  1,1  2,1
4:         P 0.008883106  2,1  2,2
5:         Q 0.009316006  2,2  2,2