Question

我想使用dplyr参数化以下计算，找出Sepal.Length的哪些值与Sepal.Width的多个值相关联：

library(dplyr)

iris %>%
    group_by(Sepal.Length) %>%
    summarise(n.uniq=n_distinct(Sepal.Width)) %>%
    filter(n.uniq > 1)

通常我会写这样的东西：

not.uniq.per.group <- function(data, group.var, uniq.var) {
    iris %>%
        group_by(group.var) %>%
        summarise(n.uniq=n_distinct(uniq.var)) %>%
        filter(n.uniq > 1)
}

但是，这种方法会导致错误，因为dplyr使用non-standard evaluation。应该如何编写这个函数？

Answer 1

您需要使用dplyr函数的标准评估版本（只需在功能名称中加上＆＃39; _＆＃39;即{。{1}}＆amp; group_by_）并将字符串传递给您的函数，然后您需要将其转换为符号。要参数化summarise_的参数，您需要使用summarise_包中定义的interp()。具体地：

lazyeval

请注意，在library(dplyr) library(lazyeval) not.uniq.per.group <- function(df, grp.var, uniq.var) { df %>% group_by_(grp.var) %>% summarise_( n_uniq=interp(~n_distinct(v), v=as.name(uniq.var)) ) %>% filter(n_uniq > 1) } not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")的最新版本中，dplyr函数的标准评估版本已"soft deprecated"支持非标准评估。

有关使用非标准评估的更多信息，请参阅Programming with dplyr vignette。

Answer 2

与最高0.5的旧dplyr版本一样，新的dplyr具有标准评估（SE）和非标准评估（NSE）的功能。但他们的表达方式与之前不同。

如果您想要NSE功能，请pass bare expressions and use enquo to capture them as quosures。如果您想要SE功能，只需直接传递quosures（或符号），然后在dplyr调用中取消引用它们。以下是该问题的SE解决方案：

library(tidyverse)
library(rlang)

f1 <- function(df, grp.var, uniq.var) {
   df %>%
       group_by(!!grp.var) %>%
       summarise(n_uniq = n_distinct(!!uniq.var)) %>%
       filter(n_uniq > 1)  
}

a <- f1(iris, quo(Sepal.Length), quo(Sepal.Width))
b <- f1(iris, sym("Sepal.Length"), sym("Sepal.Width"))
identical(a, b)
#> [1] TRUE

请注意SE版本如何使您能够使用字符串参数 - 只需使用sym()将其转换为符号。有关详细信息，请参阅programming with dplyr插图。

Answer 3

在dplyr的devel版本（即将发布0.6.0）中，我们也可以使用稍微不同的语法来传递变量。

f1 <- function(df, grp.var, uniq.var) {
   grp.var <- enquo(grp.var)
   uniq.var <- enquo(uniq.var)

   df %>%
       group_by(!!grp.var) %>%
       summarise(n_uniq = n_distinct(!!uniq.var)) %>%
       filter(n_uniq >1)  


}

res2 <- f1(iris, Sepal.Length, Sepal.Width) 
res1 <- not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
identical(res1, res2)
#[1] TRUE

这里enquo接受参数并将值作为quosure返回（类似于基数R中的替换），通过懒惰地评估函数参数并在汇总内部，我们要求它取消引用（!!或UQ）以便进行评估。

Answer 4

在当前版本的dplyr（0.7.4）中，不推荐使用标准评估函数版本（附加＆＃39; _＆＃39;到函数名称，例如group_by_）。相反，在编写函数时，你应该依赖 tidyeval 。

以下是您的功能看起来如何的示例：

# definition of your function
not.uniq.per.group <- function(data, group.var, uniq.var) {
  # enquotes variables to be used with dplyr-functions
  group.var <- enquo(group.var)
  uniq.var <- enquo(uniq.var)

  # use '!!' before parameter names in dplyr-functions
  data %>%
    group_by(!!group.var) %>%
    summarise(n.uniq=n_distinct(!!uniq.var)) %>%
    filter(n.uniq > 1)
}

# call of your function
not.uniq.per.group(iris, Sepal.Length, Sepal.Width)

如果您想了解详细信息，dplyr团队会excellent vignette了解其运作方式。

Answer 5

我过去编写了一个函数，它执行的操作与您正在执行的操作类似，不同之处在于它会探索主键外的所有列，并为每个组查找多个唯一值。

find_dups = function(.table, ...) {
  require(dplyr)
  require(tidyr)
  # get column names of primary key
  pk <- .table %>% select(...) %>% names
  other <- names(.table)[!(names(.table) %in% pk)]
  # group by primary key,
  # get number of rows per unique combo,
  # filter for duplicates,
  # get number of distinct values in each column,
  # gather to get df of 1 row per primary key, other column,
  # filter for where a columns have more than 1 unique value,
  # order table by primary key
  .table %>%
    group_by(...) %>%
    mutate(cnt = n()) %>%
    filter(cnt > 1) %>%
    select(-cnt) %>%
    summarise_each(funs(n_distinct)) %>%
    gather_('column', 'unique_vals', other) %>%
    filter(unique_vals > 1) %>%
    arrange(...) %>%
    return
  # Final dataframe:
  ## One row per primary key and column that creates duplicates.
  ## Last column indicates how many unique values of
  ## the given column exist for each primary key.
}

此功能也适用于管道操作员：

dat %>% find_dups(key1, key2)

Answer 6

您可以使用lazyeval来调用匿名函数，然后使用do来避免get。该解决方案可以更普遍地用于采用多个聚合。我通常单独写这个函数。

library(dplyr)

not.uniq.per.group <- function(df, grp.var, uniq.var) {
  df %>%
    group_by_(grp.var) %>%
    do((function(., uniq.var) {
      with(., data.frame(n_uniq = n_distinct(get(uniq.var))))
    }      
  )(., uniq.var)) %>%
  filter(n_uniq > 1)
}

not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")

Answer 7

这是使用 curly curl {{伪运算符从 rlang 0.4 实现的方法：

library(dplyr)

not.uniq.per.group <- function(data, group.var, uniq.var) {
  data %>%
    group_by({{group.var}}) %>%
    summarise(n.uniq=n_distinct({{uniq.var}})) %>%
    filter(n.uniq > 1)
}

iris %>% not.uniq.per.group(Sepal.Length, Sepal.Width)
#> # A tibble: 25 x 2
#>    Sepal.Length n.uniq
#>           <dbl>  <int>
#>  1          4.4      3
#>  2          4.6      4
#>  3          4.8      3
#>  4          4.9      5
#>  5          5        8
#>  6          5.1      6
#>  7          5.2      4
#>  8          5.4      4
#>  9          5.5      6
#> 10          5.6      5
#> # ... with 15 more rows

将参数传递给dplyr函数

7 个答案: