Question

Let this be my data:

my.data<-data.frame(name=c("a","b","b","c","c","c"))

What I need is a variable that indicates for each name, their respective relative frequency in the dataset. Essentially, this would look like that:

  name    target
1    a 0.1666667
2    b 0.3333333
3    b 0.3333333
4    c 0.5000000
5    c 0.5000000
6    c 0.5000000

What I tried is that I computed dummy variables for each name, and then based on these dummies I calculated new variables that indicate the relative frequency of each name in the dataset. See below:

temp_dummies<-data.frame(spatstat::dummify(my.data$name))
my.data<-cbind.data.frame(my.data, temp_dummies)
rm(temp_dummies)

my.data %>%
  dplyr::mutate(a_per=mean(a),
                b_per=mean(b),
                c_per=mean(c)) -> my.data

Now I need to extract the relative frequencies for each name and aggregate it back to get my target variable. I guess I should do something like this below but I don't know what to mutate.

my.data %>%
  dplyr::group_by(name) %>%
  dplyr::mutate(...) -> my.data

Questions:

How would I get my target variable using dplyr? Am I on the right track?
Is there an easier way to achive the same result?
Might it be possible to write a function that does all of this stuff automatically? It seems like a pretty standard problem that we should be able to fix by simply applying a function(x) to name.

Answer 1

使用base-R，您可以使用以下单线：

my.data$target <- (table(my.data$name)/nrow(my.data))[ my.data$name ]

说明和几行代码：

我们使用table函数获取 name 的出现次数，并用nrow将其除以df中的行数。之后，您可以在“表格”中查找当前行的“名称”。此值保存在新列的相应行中。

t <- table(my.data$name)/nrow(my.data)
my.data$target <- t[ my.data$name ]
my.data

  name    target
1    a 0.1666667
2    b 0.3333333
3    b 0.3333333
4    c 0.5000000
5    c 0.5000000
6    c 0.5000000

Answer 2

We can use add_count to get count of each name and then divide it by number of rows using n().

library(dplyr)

my.data %>%
   add_count(name) %>%
   mutate(n = n/n())

#  name      n
#  <fct> <dbl>
#1 a     0.167
#2 b     0.333
#3 b     0.333
#4 c     0.5  
#5 c     0.5  
#6 c     0.5

Calculate relative frequencies of factor levels in dataset

2 个答案: