非法列名称错误但列名称是合法的

时间:2017-10-18 17:29:49

标签: r r-caret

想知道为什么我会收到此错误。如果我将数据框中的级别设置为非法列名,我只能重现它,但为什么它在RF实现中有效?

考虑使用游侠,因为它似乎跑得更快。

library(caret)
library(ranger)
library(randomForest)

df <- data.frame(class = c(rep(c('A','B'), 10)), var1 = runif(20, 0,10), var2 = runif(20, 0,20), var3 = c(rep(c(' A','1 B', 'C'), 6), 'D','D'))
df

CTRL <- trainControl(method = "repeatedcv", 
                     number = 2, 
                     repeats = 1, 
                     verboseIter = TRUE,
                     classProbs = TRUE,
                     returnResamp = "final",
                     summaryFunction = twoClassSummary)

ranger_model <- caret::train(class ~ .,
                              df,
                              method = "ranger",
                              trControl = CTRL,
                              preProc = c("center", "scale"),
                              metric="ROC",
                              tuneGrid = expand.grid(.mtry=c(1,2)))

rf_model <- caret::train(class ~ .,
                              df,
                              method = "rf",
                              trControl = CTRL,
                              preProc = c("center", "scale"),
                              metric="ROC",
                              tuneGrid = expand.grid(.mtry=c(1,2)))

ranger_model
rf_model

游侠输出:

+ Fold1.Rep1: mtry=1 
model fit failed for Fold1.Rep1: mtry=1 Error in parse.formula(formula, data) : 
Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.

另外,当我查看产生错误的游侠文档时,我不明白为什么评估为TRUE,因为当我在DF上运行代码时,我得不到相同的结果:

## Error if illegal column name
if (!all(make.names(independent_vars[!interaction_idx]) == independent_vars[!interaction_idx])) {
stop("Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.")
}

https://github.com/cran/ranger/blob/master/R/formula.R

当我在我的df上运行时:

formula <- 'class ~ .'
data <- df

f <- as.formula(formula)
t <- terms(f, data = data)

## Get dependent var(s)
response <- data.frame(eval(f[[2]], envir = data))
colnames(response) <- deparse(f[[2]])

## Get independent vars
independent_vars <- attr(t, "term.labels")
interaction_idx <- grepl(":", independent_vars)

## Error if illegal column name
if (!all(make.names(independent_vars[!interaction_idx]) == independent_vars[!interaction_idx])) {
    print("Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger.")
}

> !all(make.names(independent_vars[!interaction_idx]) == independent_vars[!interaction_idx])
## [1] FALSE

是否因为因子列被制成1-hot编码矩阵,使用因子级别作为列名?同样,不确定为什么它可以在RF而不是游侠中工作。

思想?

1 个答案:

答案 0 :(得分:1)

这应该在插入符号6.0-77中修复。在您的示例中,您必须将splitrule参数添加到tuneGrid

library(caret)
library(ranger)
library(randomForest)

df <- data.frame(class = c(rep(c('A','B'), 10)), var1 = runif(20, 0,10), var2 = runif(20, 0,20), var3 = c(rep(c(' A','1 B', 'C'), 6), 'D','D'))
df

CTRL <- trainControl(method = "repeatedcv", 
                     number = 2, 
                     repeats = 1, 
                     verboseIter = TRUE,
                     classProbs = TRUE,
                     returnResamp = "final",
                     summaryFunction = twoClassSummary)

ranger_model <- caret::train(class ~ .,
                             df,
                             method = "ranger",
                             trControl = CTRL,
                             preProc = c("center", "scale"),
                             metric="ROC",
                             tuneGrid = expand.grid(.mtry=c(1,2), .splitrule="gini"))

rf_model <- caret::train(class ~ .,
                         df,
                         method = "rf",
                         trControl = CTRL,
                         preProc = c("center", "scale"),
                         metric="ROC",
                         tuneGrid = expand.grid(.mtry=c(1,2)))

ranger_model
rf_model