Question

我建立了一个使用大量（30个左右）独立因子变量的预测模型。由于我使用的数据集远远大于我的机器的RAM，我已经为我的训练和测试集进行了采样。

我现在希望使用该模型对整个数据集进行预测。我一次在数据集中提取100万行，每次，我都会找到一些不在我的训练和测试集中的因子变量的新级别，从而阻止模型进行预测。

由于存在如此多的独立因子变量（以及如此多的整体观察结果），手动纠正每个案例正成为一种真正的痛苦。

需要注意的另一个问题是：无法保证整个数据框和训练/测试集中变量的顺序是相同的，因为我对更改其顺序的数据进行预处理。

因此，我想编写一个函数：

根据以下内容选择和排序新数据的列我的采样数据帧的配置
循环采样和新数据框，并指定新的所有因子级别数据帧在其对应的列中不存在样本数据框为Other。
如果我的样本中存在因子级别而不是新数据帧，则将该级别（未分配任何观察值）创建到新数据框中的相应列。

我一起得到＃1，但不知道做＃2和＃3的最佳方法。如果它是任何其他语言，我会使用for循环，但我知道在R中不赞成。

这是一个可重复的例子：

sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")), montreal=factor(c("f","f","f","f","a")), boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")), montreal=factor(c("f","f","f","f","a", "a")), boston=factor(c("m","y","z","z","r", "f")), abacus=factor(c("a","b","z","a","a", "g")))

sampleData
  abacus montreal boston
1      a        f      z
2      b        f      y
3      a        f      z
4      a        f      z
5      a        a      q

dataset
  florida montreal boston abacus
1       e        f      m      a
2       q        f      y      b
3       z        f      z      z
4       d        f      z      a
5       b        a      r      a
6       a        a      f      g

sampleData <- sample[,order(names(sampleData))]
dataset <- dataset[,order(names(dataset))]
dataset <- dataset[,(colnames(sampleData)]

以下是此功能完成后我希望dataset看起来的样子（我并不真正关心dataset中列的最终排序;我＆＃39; m只是认为它对于循环（或者你认为最好的任何东西）来说是必要的。请注意，列dataset$florida被省略：

dataset
  montreal boston abacus
1   f      Other  a
2   f      y      b
3   f      z      Other
4   f      z      a
5   a      Other  a
6   a      Other  Other

另请注意，在dataset中，＆＃39; q＆＃39; boston的级别不会显示，但会显示在sampleData中。因此，如果我们省略“q”，则级别会有所不同。来自dataset中的因素，意味着在数据集＆＃39;中，我们需要boston来包含级别q，但是没有为其分配实际观察结果。

最后，请注意，由于我一次在30个变量上执行此操作，因此我需要一个程序化解决方案，而不是通过使用显式列名重新分配因子的解决方案。

Answer 1

这似乎可能有效。

从此函数中，boston列返回的新级别为Other y z q，即使级别q没有值。关于你在原始问题中的评论，我发现有效应用新因子水平的唯一方法是使用像你这样的for循环，到目前为止它对我来说效果很好。

功能，findOthers() ：

findOthers <- function(newData)  ## might want a second argument for sampleData
{
      ## take only those columns that are in 'sampleData'
    dset <- newData[, names(sampleData)]
      ## change the 'dset' columns to character
    dsetvals <- sapply(dset, as.character)
      ## change the 'sampleData' levels to character
    samplevs <- sapply(sampleData, function(y) as.character(levels(y)))
      ## find the unmatched elements
    others <- sapply(seq(ncol(dset)), function(i){
        !(dsetvals[,i] %in% samplevs[[i]])
    })
      ## change the unmatched elements to 'Other'
    dsetvals[others] <- "Other"
      ## create new data frame
    newDset <- data.frame(dsetvals)
      ## get the new levels for each column
    newLevs <- lapply(seq(newDset), function(i){
        Get <- c(as.character(newDset[[i]]), as.character(samplevs[[i]]))
        ul <- unique(unlist(Get))
    })
      ## set the new levels for each column
    for(i in seq(newDset)) newDset[,i] <- factor(newDset[,i], newLevs[[i]])
      ## result
    newDset
}

您的样本数据：

sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")), 
                         montreal=factor(c("f","f","f","f","a")), 
                         boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")), 
                      montreal=factor(c("f","f","f","f","a", "a")), 
                      boston=factor(c("m","y","z","z","r", "f")), 
                      abacus=factor(c("a","b","z","a","a", "g")))

致电 findOthers() 并使用新的系数级别查看结果：

(new <- findOthers(newData = dataset))
#   abacus montreal boston
# 1      a        f  Other
# 2      b        f      y
# 3  Other        f      z
# 4      a        f      z
# 5      a        a  Other
# 6  Other        a  Other

as.list(new)
# $abacus
# [1] a     b     Other a     a     Other
# Levels: a b Other
# 
# $montreal
# [1] f f f f a a
# Levels: f a
# 
# $boston
# [1] Other y     z     z     Other Other     
# Levels: Other y z q     ## note the new level 'q', with no value in the column

Answer 2

回答你提出的问题（而不是建议你做什么）。在这里，我们必须创建每个列的字符，然后替换然后重新分解。

sampleData = sapply(sampleData, as.character)
sampleData = gsub("q", "other", sampleData)
sampleData = sapply(sampleData, as.factor)

这取决于＆＃34; q＆＃34;只住在一列。否则，您只需单独编辑每个列以仅获得所需的更改：

sampleData = sapply(sampleData, as.character)
sampleData$boston = gsub("q", "other", sampleData$boston)
sampleData = sapply(sampleData, as.factor)

但是我认为你应该过滤火车并测试这些行的数据，因为它们很少它们对你的模型没有任何影响。否则你会让它变得困难。

summary(dataset)
dataset <- dataset[dataset$abacus!="z", ]

如果数据集非常大并且您没有这样做，那么您可能希望使用dplyr包和filter函数执行此操作。

Answer 3

这会达到你想要的效果吗？

# Select and sort the columns of dataset as in sampleData
sampleData <- sampleData[, order(names(sampleData))]
dataset <- dataset[, colnames(sampleData)]

f <- function(dataset, sampleData, col) {
    # For a given column col, assign "Other" to all factor levels 
    # in dataset[col] that do not exist in sampleData[col].
    # If a factor level exists in sampleData[col] but not in dataset[col],
    # preserve it as a factor level.
    v <- factor(dataset[, col], levels = c(levels(sampleData[, col]), "Other"))
    v[is.na(v)] <- "Other"
    v
}

# Apply f to all columns of dataset
l <- lapply(colnames(dataset), function(x) f(dataset, sampleData, x))

res <- data.frame(l) # Format into a data frame
colnames(res) <- colnames(dataset) # Assign the names of dataset
dataset <- res # Assign the result to dataset

您可以按以下方式进行测试

> dataset[, "boston"]
[1] Other y     z     z     Other Other
Levels: q y z Other
> dataset[, "montreal"]
[1] f f f f a a
Levels: a f Other
> dataset[, "abacus"]
[1] a     b     Other a     a     Other
Levels: a b Other

在两个数据帧之间传递因子属性

3 个答案: