R - 无法将数据帧中的NA更改为数字

时间:2016-12-15 18:05:55

标签: r

我有一个名为“游戏”的数据框,有几列数字。原始的csv文件有一些缺失值,当我读入它们时它变成了NA。我试图用行中值替换这些NA(已经存储为数据帧的列)。我无法让原始NA从字符强制转换为数字。

我首先找到了缺失值的索引。

ng <- which(is.na(games), arr.ind = TRUE)

然后我尝试用“linemedian”栏中的值替换NA。

games[ng] <- games[ng[,1], "linemedian"]
games[ng]
[1] " -3.25" "  9.98" " -9.1"  " -9.1"  " 14.0"  " -3.25" "  9.98" " -3.25" "  9.98" "  2.30" " 13.75" "-24.00" "  3.71" " 15.94" " 14.25" " -9.83" " 13.75" " -4.88"

用任何数字替换NA也不起作用。

games[is.na(games)] <- 0
[1] "  0.0"  "  0.0"  "  0"    "  0"    "  0"    "  0.0"  "  0.0"  "  0.0"  "  0.0"  "  0.00" "  0.00" "  0.00" "  0"    "  0"    "  0.00" "  0.00" "  0.00" "  0.00"

我认为删除空格可能会改变结果,但事实并非如此。

games[ng] <- as.numeric(trimws(games[ng[,1], "linemedian"]))
[1] "-3.25" "9.98"  "-9.1"  "-9.1"  "14"    "-3.25" "9.98"  "-3.25" "9.98"  "2.3"   "13.75" "-24"   "3.71"  "15.94" "14.25" "-9.83" "13.75" "-4.88"

其他不起作用的尝试:

games[ng] <- type.convert(games[ng]) # using type.convert()

games[, -c(1,2)] <- as.numeric(games[, -c(1,2)]) # first two columns are metadata
Error: (list) object cannot be coerced to type 'double'

games[, -c(1,2)] <- as.numeric(unlist(games[, -c(1,2)]))    

games[ng] <- as.numeric(as.character(trimws(games[ng[,1], "linemedian"])))

# New Addition from Answer
games[, sapply(games, is.numeric)][ng] <- games[, sapply(games, is.numeric)][ng[,1], "linemedian"]

我确信我分配给游戏[ng]的值是一个数字。

games[ng[,1], "linemedian"]
[1]  -3.25   9.98  -9.10  -9.10  14.00  -3.25   9.98  -3.25   9.98   2.30  13.75 -24.00   3.71  15.94  14.25  -9.83  13.75  -4.88
typeof(games[ng[,1], "linemedian"])
[1] "double"

无论我在Stack Overflow板上看到什么,显而易见的答案应该是游戏[is.na(游戏)]&lt; - VALUE。但这不起作用。有人有点想法吗?

如果你想复制,这是完整的代码:

## Download Raw Files

download.file("http://www.thepredictiontracker.com/ncaa2016.csv",
          "data/ncaa2016.csv")

download.file("http://www.thepredictiontracker.com/ncaapredictions.csv",
          "data/ncaapredictions.csv")

## Create Training and Prediction Data Sets

games <- read.csv("data/ncaa2016.csv", header = TRUE, stringsAsFactors = FALSE, 
              colClasses=c(rep("character",2),rep("numeric",72)))

preds <- read.csv("data/ncaapredictions.csv", header = TRUE, stringsAsFactors = TRUE)
colnames(preds)[colnames(preds) == "linebillings"] <- "linebill"
colnames(preds)[colnames(preds) == "linebillings2"] <- "linebill2"
colnames(preds)[colnames(preds) == "home"] <- "Home"
colnames(preds)[colnames(preds) == "road"] <- "Road"

## Remove Columns with too many missing values

rm <- unique(c(names(games[, sapply(games, function(z) sum(is.na(z))) > 50]), # Games and predictions
           names(preds[, sapply(preds, function(z) sum(is.na(z))) > 10]))) # with missing data

games <- games[, !(names(games) %in% rm)] # Remove games with no prediction data 

preds <- preds[, !(names(preds) %in% rm)] # Remove predictions with no game data 

## Replace NAs with Prediction Median
ng <- which(is.na(games), arr.ind = TRUE)
games[ng] <- games[ng[,1], "linemedian"]

另外,我不能发布整个dput()输出,但这里有一些数据集只是为了显示结构。

dput(head(games[1:6]))

structure(list(Home = c("Alabama", "Arizona", "Arkansas", "Arkansas St.", 
"Auburn", "Boston College"), Road = c("USC", "BYU", "Louisiana Tech", 
"Toledo", "Clemson", "Georgia Tech"), line = c("12", "-2", "24.5", 
"4", "-8.5", "-3"), linesag = c(12.19, 0.97, 24.26, -2.07, -4.78, 
-2.74), linepayne = c(12, -0.81, 12.53, -0.86, -10.72, -3.87), 
linemassey = c(19.15, -2.1, 21.07, -8.68, -5.45, -6.76)), .Names = c("Home", 
"Road", "line", "linesag", "linepayne", "linemassey"), row.names = c(NA, 
6L), class = "data.frame")

最后,我在x86_64-w64-mingw32上运行R版本3.2.1。

1 个答案:

答案 0 :(得分:1)

如果没有测试用例,这将是未经测试的。看起来你正在获得一个全局替换,但是由于你的一些列是字符,你可以强制从0强制的所有字符值。我可能已经尝试将进程限制为只有数字列:

games[ , sapply(games, is.numeric) ][ ng ] <- 
                        games[ , sapply(games, is.numeric)][ng[,1], "linemedian"]

在修改几乎可重现的代码后,我得出结论,您的原始代码是成功的,但检查的输出是问题区域&gt;

 str( games[ , sapply(games, is.numeric)][ng[,1], "linemedian"] )
#num [1:23] -3.25 9.98 -9.1 -9.1 14 -3.25 9.98 -3.25 9.98 2.3 ...

 games[ ng ] <- 
                         games[ , sapply(games, is.numeric)][ng[,1], "linemedian"]
games[ ng[1:2,] ]
[1] " -3.25" "  9.98"

> ng[1:2,]
     row col
[1,] 619   3
[2,] 678   3

> str(games)
'data.frame':   720 obs. of  58 variables:
 $ Home         : chr  "Alabama" "Arizona" "Arkansas" "Arkansas St." ...
 $ Road         : chr  "USC" "BYU" "Louisiana Tech" "Toledo" ...
 $ line         : num  12 -2 24.5 4 -8.5 -3 8.5 37 -10.5 5 ...
 $ linesag      : num  12.19 0.97 24.26 -2.07 -4.78 ...
 $ linepayne    : num  12 -0.81 12.53 -0.86 -10.72 ...
deleted

 > games[ c(619,678)  , 3]
#[1] -3.25  9.98
> games[ matrix(c(619,678,3,3), ncol=2)]
[1] " -3.25" "  9.98"

所以第三列在赋值后仍保持数字,但由于我不理解矩阵索引提取的print函数的输出看起来像是字符,实际上是数字。