去除异常值后R中的回归

时间:2016-03-15 21:20:01

标签: r linear-regression

我有以下data.frame

time        values    outlier  
20/01/2010   11         no          
20/02/2010   12         no
20/03/2010   11         no
20/04/2010   12         no
20/05/2010   10         no
20/06/2010   20         yes
20/07/2010   11         no
20/02/2010   12         no

我想在此数据框上运行回归,其中values作为我的自变量,time作为因变量。但我想在outlier列中排除“是”的所有行。

以下是我的尝试:

temp <- subset(df, outlier==yes)
fit  <- lm(as.vector(temp$value) ~ as.vector(temp$time))
slope   <- fit$coefficients[[2]]
intrcpt <- fit$coefficients[[1]]

temp$regression_points <- temp$value*fit$coefficients[[2]]+fit$coefficients[[1]]

现在我想使用获得的回归模型来预测temp的原始值,并将结果放回到原始数据框中,如下所示:

time        values    outlier      regression_points  
20/01/2010   11         no                11
20/02/2010   12         no                11
20/03/2010   11         no                11
20/04/2010   12         no                11
20/05/2010   10         no                11
20/06/2010   20         yes               
20/07/2010   11         no                11
20/02/2010   12         no                11

我该如何解决这个问题。

3 个答案:

答案 0 :(得分:3)

请查看以下代码

# Create example data
set.seed(1)
df <- data.frame(time = as.Date(1:100), value = runif(100), outlier = sample(0:1, 100, TRUE))

# Fit model for non-outliers
fit <- lm(value ~ time, df[df$outlier == 0, ] )

# Estimate fitted values for those that are not-outliers
df$regression_points <- ifelse(df$outlier, NA, fitted(fit, df))

#     time     value    outlier regression_points
# 1 1970-01-02 0.2655087       1                NA
# 2 1970-01-03 0.3721239       0         0.5866995
# 3 1970-01-04 0.5728534       0         0.5834598

答案 1 :(得分:3)

创建一个新数据框df2,其中包含异常值NA&#39; d,然后将其与na.exclude拟合:

df2 <- transform(df, values = ifelse(outlier == "no", values, NA))
fm <- lm(values ~ time, df2, na.action = na.exclude)
transform(df, fitted = fitted(fm))

,并提供:

        time values outlier   fitted
1 2010-01-20     11      no 11.64579
2 2010-02-20     12      no 11.49318
3 2010-03-20     11      no 11.35534
4 2010-04-20     12      no 11.20273
5 2010-05-20     10      no 11.05504
6 2010-06-20     20     yes       NA
7 2010-07-20     11      no 10.75474
8 2010-02-20     12      no 11.49318

注意:以可重现的形式使用的输入是:

Lines <- 
"time        values    outlier  
20/01/2010   11         no          
20/02/2010   12         no
20/03/2010   11         no
20/04/2010   12         no
20/05/2010   10         no
20/06/2010   20         yes
20/07/2010   11         no
20/02/2010   12         no"

df <- read.table(text = Lines, header = TRUE)
df$time <- as.Date(df$time, format = "%d/%m/%Y")

答案 2 :(得分:2)

fit <- lm(values ~ time, subset=outlier=="no", data=df)
df$regression_points <- NA
df$regression_points[df$outlier=="no"] <- fitted(fit)