如何动态进行单变量异常值处理

时间:2017-03-03 08:23:05

标签: r outliers

假设我有以下数据:

df<-iris[,1:2]# taking only 2 numeric columns

现在我想进行单变量异常值测试,其中我将异常值定义为大于1.5 * IQR的任何数据。然后在识别出任何异常值后,我将异常值上限设为95%,或者上限为5%。下端如下图所示:

a <- df$Sepal.Length
qnt_a <- quantile(a, probs = c(0.25,0.75))
caps_a <- quantile(a, probs = c(0.05,0.95))
H_a <- 1.5 * IQR(a)
a[a < (qnt_a[1] - H_a)] <- caps_a[1]
a[a > (qnt_a[1] + H_a)] <- caps_a[2]
df$Sepal.Length <- a

类似地我为其他剩余的数字变量做了:

b <- df$Sepal.Width
qnt_b <- quantile(a, probs = c(0.25,0.75))
caps_b <- quantile(a, probs = c(0.05,0.95))
H_b <- 1.5 * IQR(b)
b[b < (qnt_b[1] - H_b)] <- caps_b[1]
b[b > (qnt_b[1] + H_b)] <- caps_b[2]
df$Sepal.Width <- b

df

我想帮助制定一个循环,我可以在数据框中对所有数值变量进行识别和封闭异常值,而不是通过变量做变量......

1 个答案:

答案 0 :(得分:1)

最简单的方法是使其成为一种功能并应用它,即

f1 <- function(a){
  qnt_a <- quantile(a, probs = c(0.25,0.75))
  caps_a <- quantile(a, probs = c(0.05,0.95))
  H_a <- 1.5 * IQR(a)
  a[a < (qnt_a[1] - H_a)] <- caps_a[1]
  a[a > (qnt_a[1] + H_a)] <- caps_a[2]
  return(a)
}

df[] <- lapply(df, f1)