当一个是查找表时如何连接data.tables?

时间:2014-05-04 19:31:05

标签: r merge data.table

我在将一个简单的data.table连接示例应用于更大的(10GB)数据集时遇到了问题。 merge()在具有较大数据集的data.frames上运行得很好,尽管我喜欢利用data.table中的速度。任何人都可以指出我对data.table(特别是错误信息)的误解吗?

这是一个简单的例子(派生自这个帖子:Join of two data.tables fails)。

# The data of interest.
(DT <- data.table(id    = c(rep(1154:1155, 2), 1160),
                  price = c(1.99, 2.50, 15.63, 15.00, 0.75), 
                  key   = "id"))

     id price
1: 1154  1.99
2: 1154 15.63
3: 1155  2.50
4: 1155 15.00
5: 1160  0.75

# Lookup table.
(lookup <- data.table(id      = 1153:1160, 
                      version = c(1,1,3,4,2,1,1,2), 
                      yr      = rep(2006, 4), 
                      key     = "id"))

     id version   yr
1: 1153       1 2006
2: 1154       1 2006
3: 1155       3 2006
4: 1156       4 2006
5: 1157       2 2006
6: 1158       1 2006
7: 1159       1 2006
8: 1160       2 2006

# The desired table.  Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]

     id price version   yr
1: 1154  1.99       1 2006
2: 1154 15.63       1 2006
3: 1155  2.50       3 2006
4: 1155 15.00       3 2006
5: 1160  0.75       2 2006

较大的数据集由两个data.frames组成:temp.3561(感兴趣的数据集)和temp.versions(查找数据集)。它们分别具有与DT和查找(上面)相同的结构。使用merge()效果很好,但是我的data.table应用程序显然存在缺陷:

# Merge data.frames: works just fine
long.merged         <- merge(temp.versions, temp.3561, by = "id")

# Convert the data.frames to data.tables
DTtemp.3561         <- as.data.table(temp.3561)
DTtemp.versions     <- as.data.table(temp.versions)

# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged       <- merge(DTtemp.versions, DTtemp.3561, by = "id")

Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  : 
  Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate 
key values in i, each of which join to the same group in x over and over again. If that's ok, 
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the 
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. 
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.

DTtemp.versions具有与lookup相同的结构(在简单示例中),键“id”由779,473个唯一值(无重复)组成。

DTtemp3561具有与DT(在简单示例中)相同的结构以及一些其他变量,但其关键“id”仅具有829个唯一值,尽管有7,946,667个观察值(大量重复)。

由于我只是想将DTtemp.versions的版本号和年份添加到DTtemp.3561中的每个观察点,因此合并的data.table应该具有与DTtemp.3561(7,946,667)相同的观察数量。具体来说,我不明白为什么merge()在使用data.table时会生成“多余”的观察结果,但在使用data.frame时则不然。

同样地

# Same error message, but with 12,055,777 observations
altDTlong.merged   <- DTtemp.3561[DTtemp.versions]

# Same error message, but with 11,277,332 observations
alt2DTlong.merged  <- DTtemp.versions[DTtemp.3561]

包括allow.cartesian = T和nomatch = 0不会丢弃“多余”观察结果。

奇怪的是,如果我截断感兴趣的数据集有10个观察点,那么merge()在data.frames和data.tables上都能正常工作。

# Merge short DF: works just fine
short.3561         <- temp.3561[-(11:7946667),]
short.merged       <- merge(temp.versions, short.3561, by = "id")

# Merge short DT
DTshort.3561       <- data.table(short.3561, key = "id")
DTshort.merged     <- merge(DTtemp.versions, DTshort.3561, by = "id")

我经历了常见问题解答(http://datatable.r-forge.r-project.org/datatable-faq.pdf,特别是1.12)。你会怎么建议考虑这个?

0 个答案:

没有答案