在R中创建新数据框架

时间:2014-11-04 07:50:40

标签: r dataframe cluster-analysis reshape

我在R

中有这种格式的数据
customer_key    item_key    units
2669699            16865    1.00
2669699            16866    1.00
2669699            46963    2.00
2685256            55271    1.00
2685256            43458    1.00
2685256            54977    1.00
2685256             2533    1.00
2685256            55011    1.00
2685256            44785    2.00

但我希望将唯一的head_key作为列,我希望我的其他变量名称是item_key中的唯一值,它们的值将是这样的单位

customer_key       '16865'   '16866'  '46963'  '55271'   '43458'   '54977'    '2533'
    2669699          1.00     1.00     1.00     0.00      0.00      0.00       0.00 
    2685256          0.00     0.00     0.00     1.00      1.00      1.00       2.00

请帮我转换数据以进行聚类分析

4 个答案:

答案 0 :(得分:3)

这只是一个简单的dcast任务。假设df是您的数据集

library(reshape2)
dcast(df, customer_key ~ item_key , value.var = "units", fill = 0)
#   customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
# 1      2669699    0     1     1     0     0     2     0     0     0
# 2      2685256    1     0     0     1     2     0     1     1     1

答案 1 :(得分:3)

这是一种方式。

library(tidyr)

spread(mydf,item_key, units, fill = 0)

#  customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
#1      2669699    0     1     1     0     0     2     0     0     0
#2      2685256    1     0     0     1     2     0     1     1     1

答案 2 :(得分:3)

由于这些套餐已经涵盖(给大家+1),以下是加入聚会的几个基本解决方案:

xtabs

xtabs(units ~ customer_key + item_key, df)
#             item_key
# customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
#      2669699    0     1     1     0     0     2     0     0     0
#      2685256    1     0     0     1     2     0     1     1     1

reshape

reshape(df, direction = "wide", idvar = "customer_key", timevar = "item_key")
#   customer_key units.16865 units.16866 units.46963 units.55271
# 1      2669699           1           1           2          NA
# 4      2685256          NA          NA          NA           1
#   units.43458 units.54977 units.2533 units.55011 units.44785
# 1          NA          NA         NA          NA          NA
# 4           1           1          1           1           2

答案 3 :(得分:2)

library(dplyr); library(tidyr)
df2 <- df %>% arrange(item_key) %>% spread(item_key, units, fill=0)
df2
#   customer_key 2533 16865 16866 43458 44785 46963 54977 55011 55271
# 1      2669699    0     1     1     0     0     2     0     0     0
# 2      2685256    1     0     0     1     2     0     1     1     1

数据

df <- structure(list(customer_key = c(2669699L, 2669699L, 2669699L, 
2685256L, 2685256L, 2685256L, 2685256L, 2685256L, 2685256L), 
    item_key = c(16865L, 16866L, 46963L, 55271L, 43458L, 54977L, 
    2533L, 55011L, 44785L), units = c(1, 1, 2, 1, 1, 1, 1, 1, 
    2)), .Names = c("customer_key", "item_key", "units"), class = "data.frame", row.names = c(NA, 
-9L))