Question

我有一个数据集，想要计算3个实例：

Coins大于20
Coins等于20
Coins小于20

以下是来自更大数据集的示例：

Plan   Year   Coins   Copay
 A     2018     20      10
 B     2014     15       5
 C     2012     30       0
 D     2017     30      10
 E     2018     5       10
 F     2018     20       0
 G     2018     20       0
 H     2016     20      10
 I     2014     10       3
 J     2017     20       7

因此，我希望得到以下数量（基于上面的条件和样本数据集）

20％（10个中有2个符合条件 - C，D）
50％（10个计划满足条件 - A，F，G，H，J）
30％（10个中有3个符合条件 - B，E，I）

Answer 1

我喜欢cut进行分箱，table进行计数。 prop.table将计数变为比例。

prop.table(table(cut(your_data$Coins, breaks = c(-Inf, 19.5, 20.5, Inf))))

这只是给你比例。您可以在cut中设置自定义标签，有关详细信息，请参阅帮助页面。

使用Ell的样本数据：

df <- data.frame("coins" = c(20,15,30,30,5,20,20,20,10,20))
prop.table(table(cut(df$coins, breaks = c(-Inf, 19.5, 20.5, Inf))))
# (-Inf,19.5] (19.5,20.5] (20.5, Inf] 
#         0.3         0.5         0.2

如果您希望结果以百分比而非比例显示，则可以添加* 100。

Answer 2

对于三个条件，我们可以使用map执行此操作

library(tidyverse)
map(c('>', "==", "<"), ~ df1 %>% 
               filter(get(.x)(Coins, 20)) %>%
                     pull(Plan))
#[[1]]
#[1] "C" "D"

#[[2]]
#[1] "A" "F" "G" "H" "J"

 #[[3]]
 #[1] "B" "E" "I"

如果我们需要比例

map(c('>', "==", "<"), ~ df1 %>%
       filter(get(.x)(Coins, 20)) %>% 
       count(Plan) %>% 
       mutate(Prop = 100 *n/sum(n)) %>%
       select(-n))
#[[1]]
# A tibble: 2 x 2
#  Plan   Prop
#  <chr> <dbl>
#1 C      50.0
#2 D      50.0

#[[2]]
# A tibble: 5 x 2
#  Plan   Prop
#  <chr> <dbl>
#1 A      20.0
#2 F      20.0
#3 G      20.0
#4 H      20.0
#5 J      20.0

#[[3]]
# A tibble: 3 x 2
#  Plan   Prop
#  <chr> <dbl>
#1 B      33.3
#2 E      33.3
#3 I      33.3

如果OP用于完整数据集分组

df1 %>%
   group_by(grp = case_when(Coins < 20 ~ 'grp1', Coins ==20 ~ 'grp2', TRUE ~ 'grp3')) %>%
   summarise(Plan = toString(unique(Plan)), prop = n()) %>%
   mutate(prop = 100 * prop/sum(prop)) %>%
   ungroup %>%
   select(-grp) 
# A tibble: 3 x 2
#   Plan           prop
#   <chr>         <dbl>
#1 B, E, I        30.0
#2 A, F, G, H, J  50.0
#3 C, D           20.0

Answer 3

我使用length功能作为一个非常简单的选项

100*(length(df$coins[df$coins > 20]) /length(df$coins))
100*(length(df$coins[df$coins == 20])/length(df$coins))
100*(length(df$coins[df$coins < 20]) /length(df$coins))

给予

> 100*(length(df$coins[df$coins > 20]) /length(df$coins))
[1] 20
> 100*(length(df$coins[df$coins == 20])/length(df$coins))
[1] 50
> 100*(length(df$coins[df$coins < 20]) /length(df$coins))
[1] 30

如果你这么做很多，你可以将它包装成一个函数，你可以将它用于其他列（d）和/或感兴趣的值（p）

perc <- function(d, p){ return(c( 100*(length(d[d>p]) /length(d)), 100*(length(d[d==p])/length(d)), 100*(length(d[d<p]) /length(d)))) } perc(df$coins, 20) perc(df$coins, 90) perc(df$copay, 10)

这是基于可重复的数据框

df <- data.frame("plan" = LETTERS[1:10], "coins" = c(20,15,30,30,5,20,20,20,10,20), "copay" = c(10,5,0,10,10,0,0,10,3,7))

旁注： 鉴于您获得的答案种类繁多，我很好奇地比较了所使用的方法。我认为看到不同人的创作方法真的很棒！

在提供的数据框上运行10,000次，运行速度有一些相当大的差异（使用编写时提供的代码）。 Akrun和Hpesoj626的解决方案分别需要37秒和40秒，Gregor的速度相差2.1秒，而我的速度则为0.61秒。此外，如果按照我的建议将其包装到一个函数中，10,000次运行只需0.15秒。

格雷戈尔使用较少的字符因此是一个较短的脚本，我个人认为它非常优雅（但如果你为不同的值或列做了很多次，那么函数将是最短的方法）。我唯一关心的是它如何处理连续数据 - 想象硬币可以取值20.0000000000001 - 然后你必须将它编码为...... -Inf, 19.99999999999, 20.0000000000001, Inf ...换句话说，你必须是非常小心你如何实现它。

正如格雷戈尔所指出的那样，如果你想要更多的间隔，我的版本将需要更多的修改。

Answer 4

我认为akrun涵盖了您正在寻找的一切。但是受Gregor的回答启发，你也可以使用findInterval，然后你可以做一些akrun所做的魔术。

df1 <- df %>% mutate(Group = findInterval(Coins, c(20, 20.5)))
df1 <- df1 %>% left_join(df1 %>% 
                    group_by(Group) %>% 
                    summarise(n = n()) %>% 
                    mutate(Prop = n / sum(n) * 100)) %>%
  select(-Group, -n)
df1

#        Plan Year Coins Copay Prop
# 1     A 2018    20    10   50
# 2     B 2014    15     5   30
# 3     C 2012    30     0   20
# 4     D 2017    30    10   20
# 5     E 2018     5    10   30
# 6     F 2018    20     0   50
# 7     G 2018    20     0   50
# 8     H 2016    20    10   50
# 9     I 2014    10     3   30
# 10    J 2017    20     7   50

您也可以split Prop使用enframe来获取构成Prop的相同计划列表。

df1 %>% split(.$Prop) %>%
  enframe() %>% 
  mutate(Plan = map(value, ~toString(paste(.x$Plan)))) %>% 
  unnest(Plan) %>%
  select(-value) %>%
  rename(Prop = name) %>%
  select(Plan, Prop)

#   Plan          Prop 
#   <chr>         <chr>
# 1 C, D          20   
# 2 B, E, I       30   
# 3 A, F, G, H, J 50

基于值的计数

4 个答案: