使用NA替换数据框中所有列的所有异常值

时间:2017-09-13 09:21:43

标签: r dataframe na outliers

我有一个包含数字和因子变量组合的数据框。

我试图用NA递归替换所有异常值(3 x SD)但是我遇到以下错误的问题

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

我使用的代码是

name = factor(c("A","B","NA","D","E","NA","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
data[is.na(data)] <- 77777 
data.scale <-  scale(data)
data.scale[ abs(data.scale) > 3 ] <- NA
data <- data.scale

有关如何使其正常工作的任何建议?

1 个答案:

答案 0 :(得分:1)

这是一种方法:

library(dplyr)

# take note of order for column names
data.names <- colnames(data)

# scale all numeric columns
data.numeric <- select_if(data, is.numeric) %>% # subset of numeric columns
  mutate_all(scale)                             # perform scale separately for each column
data.numeric[data.numeric > 3] <- NA            # set values larger than 3 to NA (none in this example)

# combine results with subset data frame of non-numeric columns
data <- data.frame(select_if(data, function(x) !is.numeric(x)),
                   data.numeric)

# restore columns to original order
data <- data[, data.names]

> data
  name        mark         age     height
1    A  0.20461856 -0.80009469 -1.0844636
2    B -1.43232992 -0.55391171         NA
3   NA  0.20461856 -1.04627767 -0.1459855
4    D -0.61796862 -0.30772873  0.4796666
5    E  0.04010112 -0.06154575         NA
6   NA  0.20461856  0.18463724 -0.2711159
7    G          NA  0.43082022 -0.7090723
8    H -0.61796862          NA  1.7309707
9    H  2.01431035  2.15410109         NA

注意:非数字(字符/因子/等)变量将在此方法中的数字变量之前排序。因此,最后一步恢复原始订单(如果适用)。