Question

我正在尝试计算数据框的多个统计信息。

我试过dplyr的{{1}}。但是，结果以平面单行返回，函数名称作为后缀添加。

有没有直接的方法 - 使用summarise_each或base r - 我可以在数据框中获得结果，列是数据框的列，行是汇总函数？

dplyr

Answer 1

怎么样：

library(tidyr)
gather(df) %>% group_by(key) %>% summarise_all(funs(min, max))

# A tibble: 3 × 3
    key   min   max
  <chr> <dbl> <dbl>
1     A     2    92
2     B   111   194
3     C     0     1

Answer 2

为什么不简单地将sapply与summary一起使用？

sapply(df, summary)

给出：

            A     B    C
Min.     1.00 112.0 0.00
1st Qu. 23.75 134.5 0.00
Median  57.00 148.5 1.00
Mean    50.15 149.9 0.55
3rd Qu. 77.50 167.2 1.00
Max.    94.00 191.0 1.00

要获取数据帧，只需将sapply电话打包到data.frame()：data.frame(sapply(df, summary))。如果您希望在列中保留摘要统计信息名称，可以使用rownames(df) df$rn <- rownames(df)提取它们，或使用keep.rownames中的data.table - 参数：

library(data.table)
dt <- data.table(sapply(df, summary), keep.rownames = TRUE)

给出：

> dt
        rn     A     B   C
1:    Min. 11.00 113.0 0.0
2: 1st Qu. 21.50 126.8 0.0
3:  Median 55.00 138.0 0.5
4:    Mean 53.65 145.2 0.5
5: 3rd Qu. 83.25 160.5 1.0
6:    Max. 98.00 193.0 1.0

Answer 3

使用您建议的data.frame，并使用库purrr

library(purrr)
out <- df %>% map(~summary(.)) %>% rbind.data.frame
row.names(out) <- c("Min.", "1st Qu.", "Median", "Mean", "3rd Qu.", "Max.")
####             A     B   C
#### Min.     7.00 110.0 0.0
#### 1st Qu. 36.75 132.5 0.0
#### Median  53.50 143.5 0.5
#### Mean    55.45 151.8 0.5
#### 3rd Qu. 82.00 167.0 1.0
#### Max.    99.00 199.0 1.0

你去吧。我只想提一下，此代码仅适用于输入data.frame，仅包含100％数字变量。如果存在例如字符/因子变量，它将返回错误，因为摘要的输出完全不同。

Answer 4

这不是唯一的方法，但您可以根据需要使用package de.hybris.platform.commercefacades.product.data; import de.hybris.platform.commercefacades.product.data.ImageDataType; public class ImageData implements java.io.Serializable { /** Generated property for <code>ImageData.altText</code> property defined at extension <code>commercefacades</code>. */ private String altText; /** Generated property for <code>ImageData.format</code> property defined at extension <code>commercefacades</code>. */ private String format; /** Generated property for <code>ImageData.width</code> property defined at extension <code>acceleratorfacades</code>. */ private Integer width; /** Generated property for <code>ImageData.galleryIndex</code> property defined at extension <code>commercefacades</code>. */ private Integer galleryIndex; /** Generated property for <code>ImageData.imageType</code> property defined at extension <code>commercefacades</code>. */ private ImageDataType imageType; /** Generated property for <code>ImageData.url</code> property defined at extension <code>commercefacades</code>. */ private String url; public ImageData() { // default constructor } // Getter and Setter [...] }和dplyr重新设置data.frame。（和tidyr或其他修饰字符。）

stringr

Answer 5

不使用tidyr或dplyr的方法：

df <- data.frame(A = sample(1:100, 20), 
                 B = sample(110:200, 20), 
                 C = sample(c(0,1), 20, replace = T))
df %>% 
    lapply(summary) %>% 
    do.call("rbind", .)

输出：

  Min. 1st Qu. Median   Mean 3rd Qu. Max.
A    9    32.5   50.5  49.65   70.25   84
B  116   137.2  162.5 157.70  178.20  196
C    0     0.0    0.0   0.45    1.00    1

如果您想使用dplyr执行此操作，请尝试：

df %>% 
    gather(attribute, value) %>% 
    group_by(attribute) %>% 
    do(as.data.frame(as.list(summary(.$value))))

输出：

Source: local data frame [3 x 7]
Groups: attribute [3]

  attribute  Min. X1st.Qu. Median   Mean X3rd.Qu.  Max.
      <chr> <dbl>    <dbl>  <dbl>  <dbl>    <dbl> <dbl>
1         A     9     32.5   50.5  49.65    70.25    84
2         B   116    137.2  162.5 157.70   178.20   196
3         C     0      0.0    0.0   0.45     1.00     1

Answer 6

非常感谢你的帮助！经过一些挑选后，我使用了以下方法。

# Dataframe 
df = data.frame(A = sample(1:100, 20), 
                B = sample(110:200, 20), 
                C = sample(c(0,1), 20, replace = T))

# Add summary functions to a list
summaryFns = list(
  NA.n  = function(x) sum(is.na(x)),
  NA.percent = function(x) sum(is.na(x))/length(x),
  unique.n = function(x) ifelse(sum(is.na(x)) > 0, length(unique(x)) - 1, length(unique(x))),
  min = function(x) min(x, na.rm=TRUE),
  max = function(x) max(x, na.rm=TRUE))


# Summarise data frame with each function 
# Using dplyr:
library(dplyr)
sapply(summaryFns, function(fn){df %>% summarise_all(fn)})
#   NA.n NA.percent unique.n min max
# A 0    0          20       1   98 
# B 0    0          20       114 200
# C 0    0          2        0   1  

# Using base-r:
sapply(summaryFns, function(fn){sapply(df, fn)})
#     NA.n NA.percent unique.n min max
# A    0          0       20   1  98
# B    0          0       20 114 200
# C    0          0        2   0   1

我认为这是最直接，最灵活的方法进一步的意见，修改和建议表示赞赏。

dplyr - 多个汇总函数

6 个答案: