根据两个data.frames / data.tables计算因子级别的新列

时间:2018-06-14 13:26:40

标签: r data.table

我正在尝试为data.table dt计算新列的值。计算的一部分来自data.frame df(也可能是data.table,到目前为止我根本不需要它。)

如果因子级别(此处:sample)匹配,如何使用来自两个不同对象的值来计算新列?我过去常常合并两个对象并按行排成行,但这会导致大量的冗余数据。

这是data.frame,只有10行:

df

    sample scaling_factor
A1      A1      111956565
A2      A2       89869320
A3      A3      120925219
A4      A4      111757559
A5      A5       77319341
A6      A6       89403194
A7      A7      150214981
B8      B8      133885925
B9      B9       86536587
B10    B10      123574939


df <- structure(list(sample = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
9L, 10L, 8L), .Label = c("A1", "A2", "A3", "A4", "A5", "A6", 
"A7", "B10", "B8", "B9"), class = "factor"), scaling_factor = c(111956565.427018, 
89869319.9348599, 120925219.4453, 111757558.886234, 77319340.5841949, 
89403194.1170576, 150214980.784589, 133885925.080984, 86536586.7136393, 
123574939.026597)), .Names = c("sample", "scaling_factor"), class = "data.frame", row.names = c("A1", 
"A2", "A3", "A4", "A5", "A6", "A7", "B8", "B9", "B10"))

这是data.table,每个样本有几十万行(输出在输出中输出<时遇到问题,所以这里没有提供):

setDT(dt)
    sample     contig_id product_reads_rpk
 1:     A1     contig_10        2000.00000
 2:     A1    contig_100          24.27184
 3:     A1   contig_1000        1713.90374
 4:     A1  contig_10000        2900.66225
 5:     A1 contig_100003        1713.94231
 6:     A1 contig_100004        8575.23511
 7:     A1 contig_100004       11059.32203
 8:     A2 contig_100009        6923.67400
 9:     A2 contig_100010        1285.30259
10:     A2 contig_100015          84.74576

dt[,product_rpm := product_reads_rpk/(df$scaling_factor/1000000), by = sample]

我尝试根据product_rpm中每个样本的相应值,在dt中生成新列df。我怎么做?我得到longer object length is not a multiple of shorter object length,但较短的对象长度为1,例如df A1,对吧?

1 个答案:

答案 0 :(得分:1)

我不知道如何在不实际合并两个数据集的情况下实现此目的 - 但如果使用合并数据集的data.table方式,则可以避免创建冗余列。

所以,在你的情况下,它只是:

df <- data.table(df)
dt[df, product_rpm := (product_reads_rpk/scaling_factor/1000000), on = "sample"]

一个简单的例子:

library(data.table)

dt1 <- data.table(id = sample(1000:9999, size = 100),
                  size = sample(10000:99999, size = 100))

dt2 <- data.table(id = rep(dt1$id, 10), 
                  group = rep(LETTERS[1:5], 10),
                  value = sample(1000:9999, size = 100 * 10, replace = T))

dt3 <- dt2[dt1, metric:= (value / size), on = "id"]
head(dt3)