计算数据框列中的唯一元素

时间:2018-06-05 07:45:55

标签: r

对于下面的数据框,有59列

circleid  name  birthday  56 more...
1         1    1       
2         2    10
2         5     68
2         1    10
1         1    1

我想要的结果

circleid  distinct_name  distinct_birthday  56 more...
1         1              1       
2         3              2


quiz <- read.csv("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/circles-removed-na.csv", header = T)

到目前为止

ddply(quiz,~circleid,summarise,number_of_distinct_name=length(unique(name)))

这适用于1列如何获取完整数据帧

columns <- colnames(quiz)

for (i in c(1:58)
{
final <- ddply(quiz,~circleid,summarise,number_of_distinct_name=length(unique(columns[i])))


}

3 个答案:

答案 0 :(得分:1)

使用data.table即可运行:

library(data.table)
quiz <- fread("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/circles-removed-na.csv", header = T)
unique_vals <- quiz[, lapply(.SD, uniqueN), by = circleid]

答案 1 :(得分:1)

使用包dplyr,这很简单。原始答案为length(unique(.)),但@akrun在评论中将我指向n_distinct(.)

library(dplyr)

quiz %>%
  group_by(circleid) %>%
  summarise_all(n_distinct)
## A tibble: 2 x 3
#circleid  name birthday
#<int>    <int>    <int>
#  1        1     1
#  2        2     3

数据。

quiz <- read.table(text = "
circleid  name  birthday
1         1    1       
2         2    10
2         5     68
2         1    10
1         1    1
", header = TRUE)

答案 2 :(得分:1)

您可以使用dplyr

result<-quiz%>%
  group_by(circleid)%>%
  summarise_all(n_distinct)

microbenchmark data.tabledplyr

 microbenchmark(x1=quiz[, lapply(.SD, function(x) length(unique(x))), by = circleid],
                x2=quiz%>%
                  group_by(circleid)%>%
                  summarise_all(n_distinct),times=100)
Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval cld
   x1 150.06392 155.02227 158.75775 156.49328 158.38887 224.22590   100   b
   x2  41.07139  41.90953  42.95186  42.54135  43.97387  49.91495   100  a