Question

With a df below,

need to compute median for variable metric across the teams tm1, tm2 and tm3 on a per locid, day, hour combo basis

then filter only those locid, day, hour observations which have the same metric median across teams tm1, tm2, tm3.

set.seed(100)
df <- data.frame(
    locid = sample(c(1111,1122,1133), 20, replace=TRUE),
    day = sample(c(1:3), 20, replace=TRUE),
    hour = sample(c(1:4), 20, replace=TRUE),
    team = sample(c("tm1", "tm2", "tm3"), 20, replace=TRUE),
    metric = sample(1:5, 20, replace=TRUE )
)

my attempt

df_medians <- df %>% 
                group_by(locid + day + hour + team) %>%
                  summarise(metric_median = median(metric))

this gives the median per team for each locid + day + hour. I need to now find out the locid + day + hour combos that give the same median value across teams tm1, tm2, tm3.

df_medians %>% group_by(locid, day, hour, team) %>% summarise(??what here??)

I was trying with dplyr, but base-r solution is fine.

As a simpler example we can look at the below data- which has measurements from two different locations for two teams.

+-------+------+-------+-------+---------+
| locid |  day |  hour |  team |  metric |
+-------+------+-------+-------+---------+
|  1111 |    1 |     1 |  tm1  |       3 |
|  1111 |    1 |     1 |  tm1  |       2 |
|  1111 |    1 |     1 |  tm1  |       1 |

|  1111 |    1 |     1 |  tm2  |       1 |
|  1111 |    1 |     1 |  tm2  |       2 |
|  1111 |    1 |     1 |  tm2  |       3 |

|  1122 |    1 |     1 |  tm1  |       3 |
|  1122 |    1 |     1 |  tm1  |       2 |
|  1122 |    1 |     1 |  tm1  |       1 |

|  1122 |    1 |     1 |  tm2  |       1 |
|  1122 |    1 |     1 |  tm2  |       2 |
|  1122 |    1 |     1 |  tm2  |       1 |
+-------+------+-------+-------+---------+

step 1 - compute median by group

+-------+------+-------+-------+-------------+
| locid |  day |  hour |  team |  metric_med |
+-------+------+-------+-------+-------------+
|  1111 |    1 |     1 |  tm1  |       2     |
|  1111 |    1 |     1 |  tm2  |       2     |
|  1122 |    1 |     1 |  tm1  |       2     |
|  1122 |    1 |     1 |  tm2  |       1     |
+-------+------+-------+-------+-------------+

Step2 - compare medians across group (locid + day + hour) only (1111, 1, 1) has the metric_med same across the teams gp1 and gp2

+-------+------+-------+-------------+
| locid |  day |  hour |  metric_med |
+-------+------+-------+-------------+
|  1111 |    1 |     1 |       2     |
+-------+------+-------+-------------+

Answer 1

一种方法是将每个locid，day和hour分组成一行，然后进行比较。该解决方案适用于两组以上且复杂的条件。

library(dplyr)
library(tidyr)

data %>% 
  group_by(locid, day, hour, team) %>% 
  summarize(median = median(metric)) %>%
  spread(team, median) %>% 
  filter(tm1 == tm2)

另一种可能的解决方案是按地点，日和小时排列汇总结果，然后将一行中的中位数与其lag进行比较。此解决方案仅适用于团队中的两个小组。

data %>% 
  group_by(locid, day, hour, team) %>% 
  summarize(median = median(metric)) %>%
  arrange(locid, day, hour) %>% 
  filter(median == lag(median))

Answer 2

让我们重新演绎“所有人”等同于＆＃39;表示＆＃34;零差异或单一观察＆＃34;。因此：

df %>% # per locid, day, hour, team group_by(locid, day, hour, team) %>% # compute median summarize(team_median = median(metric)) %>% # ungroup before specifying new grouping ungroup %>% # for locid, day, hour group_by(locid, day, hour) %>% # find the medians that were the same for all teams # 'the same' here is taken to mean no variance # or having a single observation # note that, although logical vector TRUE | NA does yield TRUE # this is only because it must yield TRUE. # As another example, FALSE | NA, yields NA. # As a guard against team_medians that are NA, I add a coalesce wrapper. # I've decided that missing team_medians represent non-cases, YMMV summarize(all_equal = coalesce(n() == 1 | var(team_median) == 0), FALSE) %>% filter(all_equal == TRUE) %>% select(-all_equal)

compare aggregate value across groups

2 个答案: