通过id将列值转换为单行

时间:2017-03-16 21:10:42

标签: r tidyr

我想转换以下数据集:

players[6 + 1]

理想的数据集是:

transaction_id productsku
1              SK0001
1              SK0002
2              AB0001
2              AC0001
2              AC0002
3              BC0001
4              BC0002

所以,我使用以下代码进行转换,但失败了。

transaction_id x1       x2      x3
1              SK0001   SK0002
2              AB0001   AC0001  AC0002
3              BC0001
4              BC0002

3 个答案:

答案 0 :(得分:1)

尝试根据ToArray进行拆分,然后为每个组获取transation_id。然后,您可以productsku列表,同时对列表中的每个元素进行子集化,以便能够包含最大数量为rbind的元素。

productsku

<小时/> 数据

L = lapply(split(df, df$transaction_id), function(a) a$productsku)
max_length = max(lengths(L))
do.call(rbind, lapply(L, function(a) a[1:max_length]))
#  [,1]     [,2]     [,3]    
#1 "SK0001" "SK0002" NA      
#2 "AB0001" "AC0001" "AC0002"
#3 "BC0001" NA       NA      
#4 "BC0002" NA       NA

答案 1 :(得分:0)

这是一种方式。我们的想法是将变量组合在同一个组中,然后使用separate将它们分成不同的列:

library(tidyverse)
df %>% 
  group_by(transaction_id) %>%
  summarise(product=paste(productsku, collapse=", ")) %>%
  separate(product, c("x1", "x2", "x3"), sep=", ")

# A tibble: 4 × 4
  transaction_id     x1     x2     x3
*          <int>  <chr>  <chr>  <chr>
1              1 SK0001 SK0002   <NA>
2              2 AB0001 AC0001 AC0002
3              3 BC0001   <NA>   <NA>
4              4 BC0002   <NA>   <NA>
Warning message:
Too few values at 3 locations: 1, 3, 4 

答案 2 :(得分:0)

在两个步骤中使用data.table的简单而快速的替代方案

library(data.table)

# convert mydata into a data.table
  setDT(mydata)

# step 1: gather productsku values by transaction id
  temp <- df[, .(product = toString(productsku)), by = list(transaction_id)]

# step 2: separate productsku values in different columns
  temp[, c("x1", "x2", "x3") := tstrsplit(product, ",", fill="")] # you can also use fill=NA

temp
#>    transaction_id                product     x1      x2      x3
#> 1:              1         SK0001, SK0002 SK0001  SK0002        
#> 2:              2 AB0001, AC0001, AC0002 AB0001  AC0001  AC0002
#> 3:              3                 BC0001 BC0001                
#> 4:              4                 BC0002 BC0002    

使用dcast{data.table}的另一个快速替代方案,输出略有不同:

# Using dcast
  dcast(df, transaction_id~productsku)

#>    transaction_id AB0001 AC0001 AC0002 BC0001 BC0002 SK0001 SK0002
#> 1:              1     NA     NA     NA     NA     NA SK0001 SK0002
#> 2:              2 AB0001 AC0001 AC0002     NA     NA     NA     NA
#> 3:              3     NA     NA     NA BC0001     NA     NA     NA
#> 4:              4     NA     NA     NA     NA BC0002     NA     NA