融合数据帧 - 多列 - "来自data.tables"的增强(新)功能

时间:2017-05-31 18:07:12

标签: r data.table melt

更新:我应该更清楚一点,我试图在使用data.tables https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html重新整形时检查增强功能。更新了标题。

我的数据集包含两组变量 - Credit_Risk_Capital和Name_concentration。它们按照两种方法计算 - 新旧方法。当我使用data.table包解压它们时,变量名默认为1和2.如何将它们更改为Credit_Risk_Capital和Name_Concentration。

这是数据集

    df <-data.table (id = c(1:100),Credit_risk_Capital_old= rnorm(100, mean = 400, sd = 60),
             NameConcentration_old= rnorm(100, mean = 100, sd = 10),
             Credit_risk_Capital_New =rnorm(100, mean = 200, sd = 10),
             NameConcentration_New = rnorm(100, mean = 40, sd = 10))
    old <- c('Credit_risk_Capital_old','NameConcentration_old')
   new<-c('Credit_risk_Capital_New','NameConcentration_New')
  t1<-melt(df, measure.vars = list(old,new), variable.name = "CapitalChargeType",value.name = c("old","new"))

现在,我不想将CapitalChargeType列中的元素标记为1和2,而是希望将它们更改为Credit_risk_Capital和NameConcentration。我显然可以在后续步骤中使用“匹配”来更改它们。功能,但无论如何,我可以在融化本身内做到这一点。

3 个答案:

答案 0 :(得分:2)

这里的问题是melt()在多个度量变量的情况下不知道如何命名变量。因此,它只是简单地对变量进行编号。

David已经指出有一个feature request。但是,我将展示两种解决方法,并在速度方面对它们进行比较(加上the tidyr answer)。

  1. 第一种方法是melt()所有度量变量(保留变量名称),创建新变量名称,再次dcast()临时结果以最终得到两个值列。 austensen也正在使用此重铸方法。
  2. 第二种方法是OP要求的(同时熔化两个值列),但包括一种简单的方法来重命名变量。
  3. 重铸

    library(data.table)   # CRAN version 1.10.4 used
    # melt all measure variables
    long <- melt(df, id.vars = "id")
    # split variables names
    long[, c("CapitalChargeType", "age") := 
           tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)] 
    dcast(long, id + CapitalChargeType ~ age)
    
          id   CapitalChargeType       New       old
      1:   1 Credit_risk_Capital 204.85227 327.57606
      2:   1   NameConcentration  34.20043 104.14524
      3:   2 Credit_risk_Capital 206.96769 416.64575
      4:   2   NameConcentration  30.46721  95.25282
      5:   3 Credit_risk_Capital 201.85514 465.06647
     ---                                            
    196:  98   NameConcentration  45.38833  90.34097
    197:  99 Credit_risk_Capital 203.53625 458.37501
    198:  99   NameConcentration  40.14643 101.62655
    199: 100 Credit_risk_Capital 203.19156 527.26703
    200: 100   NameConcentration  30.83511  79.21762
    

    请注意,变量名称在最后_old之前的最后New处拆分。这是通过使用带有正向前瞻的正则表达式来实现的:"_(?=(New|old)$)"

    熔化两列并重命名变量

    在这里,我们选择David's suggestion来使用patterns()函数,这相当于指定度量变量列表。

    作为旁注:列表(或模式)的顺序决定了值列的顺序:

    melt(df, measure.vars = patterns("New$", "old$"))
    
          id variable    value1    value2
      1:   1        1 204.85227 327.57606
      2:   2        1 206.96769 416.64575
      3:   3        1 201.85514 465.06647
      ...
    
    melt(df, measure.vars = patterns("old$", "New$"))
    
          id variable    value1    value2
      1:   1        1 327.57606 204.85227
      2:   2        1 416.64575 206.96769
      3:   3        1 465.06647 201.85514
      ...
    

    正如OP已经指出的那样,用多个测量变量进行融合

    long <- melt(df, measure.vars = patterns("old$", "New$"), 
         variable.name = "CapitalChargeType",
         value.name = c("old", "New")) 
    

    返回数字而不是变量名:

    str(long)
    
    Classes ‘data.table’ and 'data.frame':    200 obs. of  4 variables:
     $ id               : int  1 2 3 4 5 6 7 8 9 10 ...
     $ CapitalChargeType: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
     $ old              : num  328 417 465 259 426 ...
     $ New              : num  205 207 202 207 203 ...
     - attr(*, ".internal.selfref")=<externalptr>
    

    幸运的是,这些是可以通过forcats包替换因子级别来轻松更改的因素:

    long[, CapitalChargeType := forcats::lvls_revalue(
      CapitalChargeType, 
      c("Credit_risk_Capital", "NameConcentration"))]
    long[order(id)]
    
          id   CapitalChargeType       old       New
      1:   1 Credit_risk_Capital 327.57606 204.85227
      2:   1   NameConcentration 104.14524  34.20043
      3:   2 Credit_risk_Capital 416.64575 206.96769
      4:   2   NameConcentration  95.25282  30.46721
      5:   3 Credit_risk_Capital 465.06647 201.85514
     ---                                            
    196:  98   NameConcentration  90.34097  45.38833
    197:  99 Credit_risk_Capital 458.37501 203.53625
    198:  99   NameConcentration 101.62655  40.14643
    199: 100 Credit_risk_Capital 527.26703 203.19156
    200: 100   NameConcentration  79.21762  30.83511
    

    请注意,melt()按照df中列的显示顺序对变量进行编号。

    reshape()

    基础R的stats包具有reshape()功能。不幸的是,它不接受具有正面预测的正则表达式。因此,不能使用自动猜测变量名称。相反,必须明确指定所有相关参数:

    old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
    new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
    reshape(df, varying = list(old, new), direction = "long", 
            timevar = "CapitalChargeType",
            times = c("Credit_risk_Capital", "NameConcentration"),
            v.names = c("old", "New"))
    
          id   CapitalChargeType       old       New
      1:   1 Credit_risk_Capital 367.95567 194.93598
      2:   2 Credit_risk_Capital 467.98061 215.39663
      3:   3 Credit_risk_Capital 363.75586 201.72794
      4:   4 Credit_risk_Capital 433.45070 191.64176
      5:   5 Credit_risk_Capital 408.55776 193.44071
     ---                                            
    196:  96   NameConcentration  93.67931  47.85263
    197:  97   NameConcentration 101.32361  46.94047
    198:  98   NameConcentration 104.80926  33.67270
    199:  99   NameConcentration 101.33178  32.28041
    200: 100   NameConcentration  85.37136  63.57817
    

    基准

    该基准包括目前讨论的所有4种方法:

    • tidyr,修改后使用具有正面预测的正常表达式,
    • recast
    • 多个值变量的
    • melt()
    • reshape()

    基准数据包含100 K行:

    n_rows <- 100L
    set.seed(1234L)
    df <- data.table(
      id = c(1:n_rows),
      Credit_risk_Capital_old = rnorm(n_rows, mean = 400, sd = 60),
      NameConcentration_old = rnorm(n_rows, mean = 100, sd = 10),
      Credit_risk_Capital_New = rnorm(n_rows, mean = 200, sd = 10),
      NameConcentration_New = rnorm(n_rows, mean = 40, sd = 10))
    

    对于基准测试,使用microbenchmark包:

    library(magrittr)
    old <- c('Credit_risk_Capital_old', 'NameConcentration_old')
    new <- c('Credit_risk_Capital_New', 'NameConcentration_New')
    microbenchmark::microbenchmark(
      tidyr = {
        r_tidyr <- df %>% 
          dplyr::as_data_frame() %>%  
          tidyr::gather("key", "value", -id) %>% 
          tidyr::separate(key, c("CapitalChargeType", "age"), sep = "_(?=(New|old)$)") %>% 
          tidyr::spread(age, value)
      },
      recast = {
        r_recast <- dcast(
          melt(df, id.vars = "id")[
            , c("CapitalChargeType", "age") := 
              tstrsplit(variable, "_(?=(New|old)$)", perl = TRUE)], 
          id + CapitalChargeType ~ age)
      },
      m2col = {
        r_m2col <- melt(df, measure.vars = patterns("New$", "old$"), 
                        variable.name = "CapitalChargeType",
                        value.name = c("New", "old"))[
                          , CapitalChargeType := forcats::lvls_revalue(
                            CapitalChargeType, 
                            c("Credit_risk_Capital", "NameConcentration"))][order(id)]
      },
      reshape = {
        r_reshape <- reshape(df, varying = list(new, old), direction = "long", 
                             timevar = "CapitalChargeType",
                             times = c("Credit_risk_Capital", "NameConcentration"),
                             v.names = c("New", "old")
        )
      },
      times = 10L
    )
    
    Unit: milliseconds
        expr       min        lq      mean    median        uq       max neval
       tidyr 705.20364 789.63010 832.11391 813.08830 825.15259 1091.3188    10
      recast 215.35813 223.60715 287.28034 261.23333 338.36813  477.3355    10
       m2col  10.28721  11.35237  38.72393  14.46307  23.64113  154.3357    10
     reshape 143.75546 171.68592 379.05752 224.13671 269.95301 1730.5892    10
    

    时间显示两列同时melt()比第二快reshape()快约15倍。两个recast变体都落后了,因为它们都需要两次重塑操作。 tidyr解决方案特别慢。

答案 1 :(得分:1)

我不确定使用melt,但这是使用tidyr

的方式

请注意,我更改了变量名称以使用.而不是_来分隔old / new的名称。这样可以更容易地将名称分成两个变量,因为已经存在许多下划线。

library(tidyr)

df <- dplyr::data_frame(
  id = c(1:100),
  Credit_risk_Capital.old= rnorm(100, mean = 400, sd = 60),
  NameConcentration.old= rnorm(100, mean = 100, sd = 10),
  Credit_risk_Capital.new =rnorm(100, mean = 200, sd = 10),
  NameConcentration.new = rnorm(100, mean = 40, sd = 10)
)

df %>% 
  gather("key", "value", -id) %>% 
  separate(key, c("CapitalChargeType", "new_old"), sep = "\\.") %>% 
  spread(new_old, value)

#> # A tibble: 200 x 4
#>       id   CapitalChargeType       new       old
#> *  <int>               <chr>     <dbl>     <dbl>
#> 1      1 Credit_risk_Capital 182.10955 405.78530
#> 2      1   NameConcentration  42.21037  99.44172
#> 3      2 Credit_risk_Capital 184.28810 370.14308
#> 4      2   NameConcentration  60.92340 120.13933
#> 5      3 Credit_risk_Capital 191.07982 389.50818
#> 6      3   NameConcentration  25.81776  90.91502
#> 7      4 Credit_risk_Capital 193.64247 327.56853
#> 8      4   NameConcentration  32.71050  94.95743
#> 9      5 Credit_risk_Capital 208.63547 286.59351
#> 10     5   NameConcentration  40.76064 116.52747
#> # ... with 190 more rows

答案 2 :(得分:0)

虽然这个问题很老,但更新的答案可能会帮助那些通过搜索定向到这个问题的人。在 data.tablemost recent 开发版本中,measure 有一个新的 melt 函数,您可以从中执行:

df <-data.table(
  id = c(1:100),
  Credit_risk_Capital_old= rnorm(100, mean = 400, sd = 60),
  NameConcentration_old= rnorm(100, mean = 100, sd = 10),
  Credit_risk_Capital_New =rnorm(100, mean = 200, sd = 10),
  NameConcentration_New = rnorm(100, mean = 40, sd = 10)
)

melt(df,
     id.vars = "id",
     measure(CapitalChargeType, value.name,
             pattern = "(.*)_(New|old)"))

获取输出:

        id   CapitalChargeType       old       New
     <int>              <char>     <num>     <num>
  1:     1 Credit_risk_Capital 409.89004 210.30058
  2:     2 Credit_risk_Capital 403.15172 197.26172
  3:     3 Credit_risk_Capital 374.90492 192.21152
  4:     4 Credit_risk_Capital 509.17491 195.39095
  5:     5 Credit_risk_Capital 429.48302 197.44441
 ---                                              
196:    96   NameConcentration  80.64747  37.61926
197:    97   NameConcentration 104.39483  13.86576
198:    98   NameConcentration 106.87475  23.15775
199:    99   NameConcentration 112.92373  44.51562
200:   100   NameConcentration 111.80915  38.40075

新版本应该会在一段时间后出现在 CRAN 上,但在那之前,您可以使用开发版本。当版本移至 CRAN 时,我会尝试更新此答案。