r data.table group by没有聚合

时间:2015-05-28 04:13:57

标签: r data.table

如何在R中获取数据表,只返回一组分组值,而不应用其他聚合函数?说我有:

test<-data.table(x=c(rep("a",2),rep("b",3)),y=1:5)

我只想回来:

a
b

当我使用时:

test[,,by=x]

我回来了:

   x y
1: a 1
2: a 2
3: b 3
4: b 4
5: b 5

当我这样做时:

test[,x,by=x]

我回来了:

   x x
1: a a
2: b b

我知道我可以使用:

test[,.(unique(x))]

但这似乎不是正确的方法,除此之外,如果我想返回两列分组?

4 个答案:

答案 0 :(得分:6)

我通过将unique()应用于仅包含我感兴趣的分组列子集的data.table来实现此目的。将data.table移至unique(),如下所示,将触发对unique.data.table()的调用,该调用对于两列或更多列的效果与以下列相同:

unique(test[, list(x)])  ## or unique(test[, x, with=FALSE])
#    x
# 1: a
# 2: b

## Add another column to see that unique.data.table() works fine in that case as well 
test[, z:=c(1,1,1,2,2)]
unique(test[, .(x,z)])   ## .() is data.table shorthand for list()
#    x z
# 1: a 1
# 2: b 1
# 3: b 2

答案 1 :(得分:2)

我写了一个R函数来做到这一点,可以在我的Github上找到它,但是我也会在这里提供它。

https://github.com/seanpili/R_PROC_TRANSPOSE

# This function mimics one of the features of SAS's PROC transpose function:
# allowing the user to do a group_by statement without aggregating the data
# producing a dataframe
#where each row represents one of the  groups that is produced, and the columns 
#represent an observation in one of those groups.
library(dplyr)

transp <- function(input,uniq_var,compare_var,transposed_column_names = 'measurement'){
  if(class(input[,uniq_var]) == "factor"){
    input[uniq_var] = sapply(input[uniq_var],as.character)
  }
  #' input is the dataframe/data.table that you want to perform the operation on, uniq_var is the variable that you are groupying by, compare_var is the variable that is being measured in each of the groups, and transposed_colum_names is just an optional string for the user to call each of their columns (will be concatenated with an observation number, i.e. if you input 'distance', it will name the observations  'distance_1','distance_2','distance_3'...ect.)
  list_df <- input %>% group_by(input[,uniq_var]) %>% do(newcol = t(.[compare_var]))
  # it gets us the aggregates we want, BUT all of our columns are stored in a list 
  # instead of in separate columns.... so we need to create a new dataframe with the dimensions 
  # rows = the number of unique values that we are "grouping" by, noted here by uniq_var and the number of columns will be 
  # the maximum number of observations that are assigned to one of those groups.

  # so first we will create the skeleton of the matrix, and then use a user defined function 
  # to fill it with the correct values 
  new_df <- matrix(rep(NA,(max(count(input,input[,uniq_var])[,2])*dim(list_df)[1])),nrow = dim(list_df)[1])
  new_df <- data.frame(new_df)
  new_df <- cbind(list_df[,1],new_df)
  # i am writing a function inside of a function becuase for loops can take a while 
  # when doing operaitons on multiple columns of a dataframe
  func2 <- function(input,thing = new_df){

    # here, we have a slightly easier case when we have the maximum number of children 
    # assigned to a household.
    # we subtract 1 from the number of columns because the first column holds the value of the 
    # unique value we are looking at, so we don't count it 

    if(length(input[2][[1]])==dim(thing)[2]-1){
      # we set the row corresponding to the specific unique value specified in our list_df of aggregated values
      # equal to the de-aggregated values, so that you have a column for each value like in PROC Transpose. 
      thing[which(thing[,1]==input[1]),2:ncol(thing)]= input[2][[1]]

      #new_df[which(new_df[,1]==input[1]),2:ncol(new_df)]= input[2][,1][[1]][[1]]
    }else{
      thing[which(thing[,1]==input[1]),2:(1+length(input[2][[1]]))]= input[2][[1]]
    }
    # if you're wondering why I have to use so many []'s it's because our list_df has 1 column 
    # of unique identifiers and the other column is actually a column of dataframes
    # each of which only has 1 row and 1 column, and that element is a list of the transposed values 
    # that we want to add to our new dataframe 
    # so essentially the first bracket 

    return(thing[which(thing[,1]==input[1]),])
  }

  quarter_final_output <- apply(list_df,1,func2)
  semi_final_output <- data.frame(matrix(unlist(quarter_final_output),nrow = length(quarter_final_output),byrow = T))
  #return(apply(list_df,1,func2))
  # this essentially names the columns according to the column names that a user would typically specify 
  # in a proc transpose. 
  name_trans <- function(trans_var=transposed_column_names,uniq_var = uniq_var,df){
    #print(trans_var)
    colnames(df)[1] = colnames(input[uniq_var])
    colnames(df)[2:length(colnames(df))] = c(paste0(trans_var,seq(1,(length(colnames(df))-1),1)))
    return(df)

  }
  final_output <- name_trans(transposed_column_names,uniq_var,semi_final_output)
  return(final_output)

}

答案 2 :(得分:0)

同意Josh unique()是正确的选择,但也许可以考虑这种方法:

> unique(test$x) 
[1] "a" "b"

另外,如果你想要行:

> rbind(unique(test$x))
     [,1] [,2]
[1,] "a"  "b" 

或列:

> cbind(unique(test$x))
     [,1]
[1,] "a" 
[2,] "b" 

答案 3 :(得分:0)

晚会但我知道你的要求

没有直接答案,但这是一种解决方法。

 test[,x,by=x][,x]  # Suppress one of the x's

   [1] "a" "b"

invisible()也应如下所述:

我只使用j作副作用,但我仍然得到了返回的数据。 我怎么阻止它? 在这种情况下,j可以用invisible()包装;例如,DT [,不可见(hist(colB)),by = colA] http://datatable.r-forge.r-project.org/datatable-faq.pdf

或者那也是一个解决方案。

 test[,invisible(x),by=x]  # Still prints j, just hides its name!

   x V1
1: a  a
2: b  b

但是,以下内容可能会让您愉快地放弃任务:

为什么按键列中的列分组比ad hoc更快?

因为每个组在RAM中是连续的,所以可以最小化页面提取和内存 批量复制(在C中使用memcpy)而不是在C中循环。 http://datatable.r-forge.r-project.org/datatable-faq.pdf