R: customised function for VIF

时间:2018-01-23 19:36:22

标签: r regression correlation r-car

I am trying to write a loop to calculate Variance Inflation Factor. I understand there are functions and packages that can do this for me but I need some sort of customisation.

A sample data

  library(MASS)
  library(clusterGeneration)

  set.seed(2)
  num.vars <- 30
  num.obs<-200
  cov.mat<- genPositiveDefMat(num.vars,covMethod="unifcorrmat")$Sigma
  rand.vars<- mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat)

  cov.mat <- as.data.frame(cov.mat)
  names(cov.mat) <- rep(paste0("X",1:30))

This dataframe has 30 columns (predictors).

Here's my logic of the loop:

1) Regress each predictor against other predictors and calculate R2. Convert R2 to VIF using VIF = 1/1 - R2. This will give me 30 VIF values.

2) Sort the VIF value. If the top predictor has VIF > 10, remove the predictor from the cov.mat. cov.mat will have 29 predictors now.

3) Repeat Step 1 i.e regress each predictor against other predictors and calculate VIF again (29 VIFs this time). If max VIF > 10, remove the variable with the highest VIF and keep doing till max VIF <= 10.

However, the catch is I want to keep X4 , X6 and X10 even if their VIF > 10 in a given iteration. So in the above process if X4 or X6 or X10 comes out to have the highest VIF (> 10) in an iteration, remove the variable with the second highest VIF (only if the second highest VIF is also > 10 and is not X4 or X6 or X10). I hope this is clear

  mat <- matrix(, ncol = 2, nrow = nrow(cov.mat)) #  this will store the 30 VIFs

for(i in 1:ncol(cov.mat)){
      mdl <- lm(cov.mat[,i] ~ ., data = cov.mat) # this will regress each column against other columns but throws an error when i = 2
      r.squared <- unlist(summary(mdl)[8]) # this gives the r-squared of predictor i
      vif <- 1/(1- r.squared^2) # calcualtion of VIF for predictor i
      mat[i,2]  <- vif
      mat[i,1]  <- names(cov.mat[i])
  }

Let's say the above loop works fine and I have a matrix with first column as variables names and second column with VIF values.

     df <- data.frame(mat)
     names(df) <- c("variable", "vif")
     df <- df[sort(df$vif),]

     ifelse(df[1,2] <= 10, stop, ifelse(df[1,2] > 10 & names(df[1,1]) != "X4" | names(df[1,1]) != "X6" | names(df[1,1]) != "X10", ....

This is where I am lost.

I first need to check if the variables with the highest VIF is > 10 and is not among X4 or x6 and X10, and remove the variable from dataframe cov.mat. If the variable with highest VIF (given VIF > 10) is either X4 or X6 or X10, then go to the second row of df and evaluate whether its VIF > 10 or not and whether it is not among X4, X6 or X10 and if it meets the condition, remove it from the cov.mat and start the iteration again.

EDIT

My original data frame has 51 columns and 1458 rows. When I run the above function, it gives me an error there are aliased coefficients in the model. Why is this happening?

1 个答案:

答案 0 :(得分:1)

在您的示例数据中,无法为整个数据集计算或VIF分数,这很可能是因为完美的共线性。这里的函数应该适用于不是这种情况的数据(例如,数据集的第1:15列)。您可以忽略/删除所有cat代码。那仅仅是为了说明正在发生的事情

此外,我使用包car作为函数vif

library(vif)

vif_fun <- function(df, keep_in) {
             # df: the dataset of interest
             # keep_in: the variables that should be kept in  
             highest <- c()
             while(TRUE) {
                # the rnorm() below is arbitrary as the VIF should not 
                # depend on it
                vifs <- vif(lm(rnorm(nrow(df)) ~. , data = df))
                adj_vifs <- vifs[-which(names(vifs) %in% keep_in)]
                if (max(adj_vifs) < 10) {
                     break
                }
               cat("\n")
               print(vifs)
               highest <- c(highest,names((which(adj_vifs == max(adj_vifs)))))
               cat("\n")
               cat("removed:", highest)
               cat("\n")
               df <- df[,-which(names(df) %in% highest)]

              }
            cat("\n")
            cat("final variables: \n")
            return(names(vifs))
              }

# example with mtcars dataset
vif_fun(mtcars,keep_in = c("cyl"))


# example using part of your data
vif_fun(cov.mat[,1:15], keep_in = c("X15", "X12"))