从datacamp的介绍到dplyr的一些(基本)问题

时间:2015-10-06 06:46:40

标签: r dplyr

DataCamp介绍dplyr课程的几个基本问​​题:

为什么:

hflights %>% 
  group_by(UniqueCarrier,Dest) %>%
  summarize(n=n()) %>%
  mutate(rank=rank(n)) %>%
  filter(rank==1)

生成与以下不同的答案:

 hflights %>% 
      group_by(UniqueCarrier, Dest) %>%
      summarise(n = n()) %>%
      mutate(rank = rank(desc(n))) %>%
      filter(rank == 1)

唯一的区别是排名顺序,但不应过滤与项目排名顺序不相关吗?

其次,为什么mean(ArrDelay> 0)在下面的情况下生成ArrDelay> 0的航班比例?难道它只是给你所有具有正延迟的航班的平均延迟吗?

hflights %>% 
  filter(!is.na(ArrDelay)) %>%
  group_by(UniqueCarrier) %>%
  summarize(p_delay=mean(ArrDelay>0)) %>%
  mutate(rank=rank(p_delay)) %>%
  arrange(rank)

谢谢!

1 个答案:

答案 0 :(得分:2)

I don't really understand the first question. Why would you expect the same results? Have a look at what desc actually does, e.g. desc(1:3). Clearly the ranks should be different.

rank(1:3)
## [1] 1 2 3
rank(desc(1:3))
## [1] 3 2 1

For your second question: ArrDelay > 0 is a logical. When you take the mean of a logical, it converts it to numeric first (TRUE -> 1, FALSE -> 0). Then it takes the mean, which is the proportion of TRUEs. To get the mean of all delays with positive delay, use

hflights %>% 
  filter(!is.na(ArrDelay)) %>%
  group_by(UniqueCarrier) %>%
  summarize(p_delay=mean(ArrDelay[ArrDelay>0])) %>%
  mutate(rank=rank(p_delay)) %>%
  arrange(rank)