Question

假设我们有一个data.frame，如下所示：

City
NYC
Boston
NYC
NYC
Providence 
Boston
NYC

我想编写最简单的函数

redistribute <- function(data, column, unique_value, decrease_by) {
  #data = dataframe provided by user
  #column = column of the respective dataframe
  #unique_value = fields contained within the respective column of the respective dataframe
  #decrease_by = the desired "portion" or "distribution" of the unique_value within column. 
}

编辑：

我将重述该问题，因为它似乎有些令人困惑。

我需要计算列中（参数unique_value）的频率。 例如，“城市”列中的纽约市为4/7或0.57。
减少唯一值的出现次数，以使频率达到用户在function参数中提供的频率。 例如，对于纽约市，从0.57到（自变量减少率）。因此，例如从0.57到0.10。
将原来由unique_value占用的原始字段替换为该列中的不同值。随机执行此操作。 例如，我们删除了首次出现的“ NYC”字段，以将唯一值“ NYC”的总体频率从0.5降低到0.1，并用一些随机城市“波士顿”代替。

所以预期结果将是：

City
NYC 
Boston
Boston
Providence
Boston
Providence
Boston

我想避免做很多转换。我正在寻找最合乎逻辑/最有效的方法。

Answer 1

您我想尝试做的实际上只是将一些东西组合到一个函数中。以您的示例为例，假设new_level是您希望在新数据中该因子的百分比。

city = c("NYC", "Boston", "NYC", "NYC", "Providence", "Boston", "NYC")
data = data.frame(city=city)

redistribute <- function(data, column, unique_value, new_level){
        ## Names of factors and size of data
        fac_names <- levels(factor(data[,column]))
        size <- nrow(data)

        ## Make new list using rep and sample with desired ratio
        new_col <- c(rep(unique_value,
                        floor(new_level*size)),
                        sample(fac_names[which(fac_names!=unique_value)],
                               size=(size-floor(new_level*size)),
                               replace=TRUE))

        ## Mix up and assign to data frame
        data[,column] <- sample(new_col)
        return(data)
}

redistribute(data, column="city",
                unique_value="NYC",
                new_level=0.3)

通过给定分布来更改分类变量中字段的分布

1 个答案: