Question

This question很好地描述了我的问题的设置。

然而，我有一个名为algorithm的因子，而不是第二个值。我的数据框如下所示（注意即使在其组内也存在多个值的可能性）：

algorithm <- c("global", "distributed", "distributed", "none", "global", "global", "distributed", "none", "none")
v <- c(5, 2, 6, 7, 3, 1, 10, 2, 2)
df <- data.frame(algorithm, v)
df
    algorithm  v
1      global  5
2 distributed  2
3 distributed  6
4        none  7
5      global  3
6      global  1
7 distributed 10
8        none  2
9        none  2

我想按v对数据帧进行排序，但是获取与其组（算法）相关的每个条目的排序位置。然后应将此位置添加到原始数据框中（因此我不需要重新排列它），因为我想使用ggplot将计算的位置绘制为x，将值绘制为y（按算法分组，例如每个算法是一组点。）

所以结果应该是这样的：

    algorithm  v  groupIndex
1      global  5  3
2 distributed  2  1
3 distributed  6  2
4        none  7  3
5      global  3  2
6      global  1  1
7 distributed 10  3
8        none  2  1
9        none  2  2

到目前为止，我知道我可以先通过算法对数据进行排序，然后按值或反过来进行排序。我想在第二步中我必须计算每组内的指数？有没有一种简单的方法可以做到这一点？

df[order(df$algorithm, df$v), ]
    algorithm  v
2 distributed  2
3 distributed  6
7 distributed 10
6      global  1
5      global  3
1      global  5
8        none  2
9        none  2
4        none  7

修改：无法保证每组的参赛作品数量相同！

Answer 1

每组中order的双重应用应涵盖它：

ave(df$v, df$algorithm, FUN=function(x) order(order(x)) )
#[1] 3 1 2 3 2 1 3 1 2

这相当于：

ave(df$v, df$algorithm, FUN=function(x) rank(x,ties.method="first") )
#[1] 3 1 2 3 2 1 3 1 2

，这反过来意味着如果您担心速度，可以利用frank中的data.table：

setDT(df)[, grpidx := frank(v,ties.method="first"), by=algorithm]
df
#     algorithm  v grpidx
#1:      global  5      3
#2: distributed  2      1
#3: distributed  6      2
#4:        none  7      3
#5:      global  3      2
#6:      global  1      1
#7: distributed 10      3
#8:        none  2      1
#9:        none  2      2

Answer 2

一种方式如下。我认为，您可以使用v为每个组订购with_order()个值。您可以在函数中使用row_number()指定排名。通过这种方式，您可以跳过在尝试使用order()时为每个组排列数据的步骤。

library(dplyr)
group_by(df, algorithm) %>%
mutate(groupInd = with_order(order_by = v, fun = row_number, x = v))

#    algorithm     v groupInd
#       <fctr> <int>    <int>
#1      global     5        3
#2 distributed     2        1
#3 distributed     6        2
#4        none     7        3
#5      global     3        2
#6      global     1        1
#7 distributed    10        3
#8        none     2        1
#9        none     2        2

逐列排序数据，在组内添加索引

2 个答案: