为什么rbindlist不尊重列名?

时间:2014-02-06 14:18:36

标签: r data.table rbind

我刚发现这个错误,却发现有些人称之为"feature"。这使得rbindlist不像do.call("rbind",l)那样rbind会尊重列名。此外,文档中没有提到这种完全出乎意料的行为。这真的是故意的吗?

代码示例:

> library(data.table)
> DT1 <- data.table(a=1, b=2)
> DT2 <- data.table(b=3, a=4)
> DT1
a b
1: 1 2
> DT2
b a
1: 3 4

我希望rbind这些会产生a = 1,4的列; b = 2,3。使用rbind.data.tablerbind.data.frame即可获得该结果,但rbind.data.table会产生警告。

> rbind(DT1, DT2)
a b
1: 1 2
2: 4 3
Warning message:
In data.table::.rbind.data.table(...) :
Argument 2 has names in a different order. Columns will be bound by name for consistency with base. You can drop names (by using an unnamed list) and the columns will then be joined by position, or set use.names=FALSE. Alternatively, explicitly setting use.names to TRUE will remove this warning.
> rbind(as.data.frame(DT1), as.data.frame(DT2))
a b
1 1 2
2 4 3
> do.call('rbind', list(DT1, DT2))
a b
1: 1 2
2: 4 3
Warning message:
In data.table::.rbind.data.table(...) :
Argument 2 has names in a different order. Columns will be bound by name for consistency with base. You can drop names (by using an unnamed list) and the columns will then be joined by position, or set use.names=FALSE. Alternatively, explicitly setting use.names to TRUE will remove this warning.
然而,

rbindlist很乐意默默地破坏数据:

> rbindlist(list(DT1, DT2))
a b
1: 1 2
2: 3 4

1 个答案:

答案 0 :(得分:6)

此功能现已在commit 1266 of v1.9.3中实施。来自NEWS

o  'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented 
   entirely in C. Closes #5249    
  -> use.names by default is FALSE for backwards compatibility (doesn't bind by 
     names by default)
  -> rbind(...) now just calls rbindlist() internally, except that 'use.names' 
     is TRUE by default, for compatibility with base (and backwards compatibility).
  -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
  -> At least one item of the input list has to have non-null column names.
  -> Duplicate columns are bound in the order of occurrence, like base.
  -> Attributes that might exist in individual items would be lost in the bound result.
  -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
  -> And incredibly fast ;).
  -> Documentation updated in much detail. Closes DR #5158.

通过这种方式,您可以将use.names=TRUE设置为按名称绑定。默认情况下,它设置为FALSE以实现向后兼容性。或者,您可以使用rbind(..)其中use.names=TRUE,以便向后兼容。

有关更多示例,请参阅this post;有关基准,请参见this post

示例:

1)只需设置use.names=TRUE

即可
DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=1, x=2)

rbindlist(list(DT1,DT2), use.names=TRUE, fill=FALSE)
#    x y
# 1: 1 2
# 2: 2 1

DT1 <- data.table(x=1, y=2)
DT2 <- data.table(z=2, y=1)

# returns error when fill=FALSE but can't be bound without fill=TRUE
rbindlist(list(DT1, DT2), use.names=TRUE, fill=FALSE)
# Error in rbindlist(list(DT1, DT2), use.names = TRUE, fill = FALSE) : 
    # Answer requires 3 columns whereas one or more item(s) in the input 
    # list has only 2 columns. ...

2)还按发生顺序绑定重复的列名称:

DT1 <- data.table(x=1, x=2, y=10, y=20, y=30)
DT2 <- data.table(y=-10, x=-2, y=-20, x=-1, y=-30)

rbindlist(list(DT1,DT2), use.names=TRUE)

#     x  x   y   y   y
# 1:  1  2  10  20  30
# 2: -2 -1 -10 -20 -30

3)如果要按名称绑定并填充缺少的列

,请使用fill=TRUE
DT1 <- data.table(x=1, y=2)
DT2 <- data.table(y=2, z=-1)

rbindlist(list(DT1, DT2), fill=TRUE)
#     x y  z
# 1:  1 2 NA
# 2: NA 2 -1

HTH