从回归估计中提取交互项

时间:2018-05-17 14:50:13

标签: r regression lm

这是一个简单的问题,但我无法在任何地方找到明确而有说服力的答案。如果我有一个包含一个或多个交互项的回归模型,例如:

mod1 <- lm(mpg ~ factor(cyl) * factor(am), data = mtcars)
coef(summary(mod1))
##                           Estimate Std. Error   t value              Pr(>|t|)
## (Intercept)              22.900000   1.750674 13.080673 0.0000000000006057324
## factor(cyl)6             -3.775000   2.315925 -1.630018 0.1151545663620229670
## factor(cyl)8             -7.850000   1.957314 -4.010599 0.0004547582690011110
## factor(am)1               5.175000   2.052848  2.520888 0.0181760532676256310
## factor(cyl)6:factor(am)1 -3.733333   3.094784 -1.206331 0.2385525615801434851
## factor(cyl)8:factor(am)1 -4.825000   3.094784 -1.559075 0.1310692573417492068

确定哪些系数估算值适用于互动条件的确定方法是什么?显而易见的方法是grep()为术语名称中的冒号。但是让我们假设一秒钟因为类似的事情而无法实现:

mtcars$cyl2 <- factor(mtcars$cyl, levels = c(4,6,8), labels = paste("Cyl:", unique(mtcars$cyl)))
mod2 <- lm(mpg ~ cyl2 * factor(am), data = mtcars)
##                         Estimate Std. Error   t value              Pr(>|t|)
## (Intercept)            22.900000   1.750674 13.080673 0.0000000000006057324
## cyl2Cyl: 4             -3.775000   2.315925 -1.630018 0.1151545663620229670
## cyl2Cyl: 8             -7.850000   1.957314 -4.010599 0.0004547582690011110
## factor(am)1             5.175000   2.052848  2.520888 0.0181760532676256310
## cyl2Cyl: 4:factor(am)1 -3.733333   3.094784 -1.206331 0.2385525615801434851
## cyl2Cyl: 8:factor(am)1 -4.825000   3.094784 -1.559075 0.1310692573417492068

我想也许terms()对象会有用,但事实并非如此。我也可能会对术语的排序/编号做出一些假设,以获得预期的结果:

coef(summary(mod2))[5:6,]
##                         Estimate Std. Error   t value  Pr(>|t|)
## cyl2Cyl: 4:factor(am)1 -3.733333   3.094784 -1.206331 0.2385526
## cyl2Cyl: 8:factor(am)1 -4.825000   3.094784 -1.559075 0.1310693

但我不知道如何以一般方式这样做。

可以做些什么?

6 个答案:

答案 0 :(得分:4)

这似乎有点复杂,但我们可以只列举所有主要效果,然后采取设定差异吗?

mod2 <- lm(mpg ~ cyl2 * factor(am) + wt * disp, data = mtcars)
variables <- labels(mod2)[attr(terms(mod2), "order") == 1]
factors <- sapply(names(mod2$xlevels), function(x) paste0(x, mod2$xlevels[[x]])[-1])
setdiff(colnames(model.matrix(mod2)), c("(Intercept)", variables, unlist(factors)))
# [1] "cyl2Cyl: 4:factor(am)1" "cyl2Cyl: 8:factor(am)1" "wt:disp" 

答案 1 :(得分:2)

可能的单线解决方案是:

"assign"

这很麻烦*,但基本上它使用了模型矩阵的attr(model.matrix(mod2), "assign") ## [1] 0 1 1 2 3 3 属性:

which(colSums(attr(terms(mod2), "factors")[,attr(model.matrix(mod2), "assign")[-1L]]) == 2L)
## cyl2:factor(am) cyl2:factor(am) 
##               4               5

获取交互项系数的名称索引(off-by-one):

coef()

然后可用于从"assign"中提取相关行。这是因为系数的顺序由设计矩阵中的列的顺序决定,并且其== 2L属性将这些列映射回原始公式中的项。

它不是超级干净但似乎有用。

* PS。在该示例中,仅提取二阶交互项。将>= 2L更改为express-react-view或其他内容可以获得更高阶的互动。

答案 2 :(得分:2)

我构建了一些只做这个in coefplot的东西。

#' @title matchCoefs.default
#' @description Match coefficients to predictors
#' @details Matches coefficients to predictors using information from model matrices
#' @author Jared P. Lander
#' @aliases matchCoefs.default
#' @import reshape2
#' @param model Fitted model
#' @param \dots Further arguments
#' @return a data.frame matching predictors to coefficients
matchCoefs.default <- function(model, ...)
{
    # get the terms
    theTerms <- model$terms
    # get the assignment position
    #thePos <- model$assign
    thePos <- get.assign(model)
    # get intercept indicator
    inter <- attr(theTerms, "intercept")
    # get coef names
    coefNames <- names(stats::coef(model))
    # get pred names
    predNames <- attr(theTerms, "term.labels")
    # expand out pred names to match coefficient names
    predNames <- predNames[thePos]
    # if there's an intercept term add it to the pred names
    if(inter == 1)
    {
        predNames <- c("(Intercept)", predNames)
    }

    # build data.frame linking term to coefficient name
    matching <- data.frame(Term=predNames, Coefficient=coefNames, stringsAsFactors=FALSE)

    ## now match individual predictor to term
    # get matrix as data.frame
    factorMat <- as.data.frame(attr(theTerms, "factor"))
    # add column from rownames as identifier
    factorMat$.Pred <- rownames(factorMat)
    factorMat$.Type <- attr(theTerms, "dataClasses")

    # melt it down for comparison
    factorMelt <- melt(factorMat, id.vars=c(".Pred", ".Type"), variable.name="Term")
    factorMelt$Term <- as.character(factorMelt$Term)

    # only keep rows where there's a match
    factorMelt <- factorMelt[factorMelt$value == 1, ]

    # again, bring in coefficient if needed
    if(inter == 1)
    {
        factorMelt <- rbind(data.frame(.Pred="(Intercept)", .Type="(Intercept)", Term="(Intercept)", value=1, stringsAsFactors=FALSE), factorMelt)
    }

    # join into the matching data.frame
    matching <- join(matching, factorMelt, by="Term")

    return(matching)
}

然后你可以拨打matchCoefs.default(mod1)

答案 3 :(得分:1)

另一种方法(也有点复杂)是通过协变量名称,看看其他协变量的名称出现了多少次:

mtcars$cyl2 <- factor(mtcars$cyl, levels = c(4,6,8), labels = paste("Cyl:", unique(mtcars$cyl)))
mod2 <- lm(mpg ~ cyl2 * factor(am), data = mtcars)
coef(summary(mod2))
#>                         Estimate Std. Error   t value     Pr(>|t|)
#> (Intercept)            22.900000   1.750674 13.080673 6.057324e-13
#> cyl2Cyl: 4             -3.775000   2.315925 -1.630018 1.151546e-01
#> cyl2Cyl: 8             -7.850000   1.957314 -4.010599 4.547583e-04
#> factor(am)1             5.175000   2.052848  2.520888 1.817605e-02
#> cyl2Cyl: 4:factor(am)1 -3.733333   3.094784 -1.206331 2.385526e-01
#> cyl2Cyl: 8:factor(am)1 -4.825000   3.094784 -1.559075 1.310693e-01

# Essentially an ugly double-loop
check.inter <- function(coefs){
  which(lapply(coefs, function(co){
         sum(unlist(lapply(coefs, function(co2){
                    grepl(co2, co, fixed = T)})))})>1)}

check.inter(names(coefficients(mod2)))
#> [1] 5 6

coef(summary(mod2))[check.inter(names(coefficients(mod2))),]
#>                         Estimate Std. Error   t value  Pr(>|t|)
#> cyl2Cyl: 4:factor(am)1 -3.733333   3.094784 -1.206331 0.2385526
#> cyl2Cyl: 8:factor(am)1 -4.825000   3.094784 -1.559075 0.1310693

reprex package(v0.2.0)创建于2018-05-17。

答案 4 :(得分:1)

我会选择整洁的东西:

# Look for factor variables and replace ':' with '_'
clean_colons = function(x){
  x %>% as_tibble %>%
    mutate_if(is.factor, function(x){str_replace_all(x,":","_")}) %>% 
    return()
}

# Define the tricky cyl2 varible
mtcars$cyl2 = factor(mtcars$cyl, levels = c(4,6,8),
                     labels = paste0("Cyl:", unique(mtcars$cyl)))

# Create the model and output parameters
mtcars %>% clean_colons %>% lm(mpg ~ cyl2 * factor(am), data = .) %>% tidy %>%
  as_tibble
# A tibble: 6 x 5
  term                  estimate std.error statistic  p.value
  <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)              19.1       1.52    12.6   1.38e-12
2 cyl2Cyl_6                 3.77      2.32     1.63  1.15e- 1
3 cyl2Cyl_8                -4.08      1.75    -2.33  2.80e- 2
4 factor(am)1               1.44      2.32     0.623 5.39e- 1
5 cyl2Cyl_6:factor(am)1     3.73      3.09     1.21  2.39e- 1
6 cyl2Cyl_8:factor(am)1    -1.09      3.28    -0.333 7.42e- 1

# Now we no longer have the colon-in-variable-name problem, so we can do
mtcars %>% clean_colons %>% lm(mpg ~ cyl2 * factor(am), data = .) %>% tidy %>%
  as_tibble %>% filter(term %>% str_detect("\\w{1}:\\w{1}")) %>% pull(estimate)
[1]  3.733333 -1.091667

<强>更新

以上失败,如果冒号后面没有空格,可以这样解决:

    Set xmlDoc = CreateObject("Msxml2.DOMDocument")

Dim nodeBook
Dim nodeId
xmlDoc.async = False
xmlDoc.Load ("xmlfile url")
If (xmlDoc.parseError.errorCode <> 0) Then
   Dim myErr
   Set myErr = xmlDoc.parseError
   MsgBox ("You have error " & myErr.reason)
Else
   Set nodeBook = xmlDoc.selectSingleNode("//Program")
   Set nodeId = nodeBook.getAttributeNode("description")
   wscript.Echo nodeId.value
End If

答案 5 :(得分:1)

这也很复杂,但至少你可以明确地跟踪你在做什么:

# require package
require(Hmisc)

# load and process data
data(mtcars)
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)

# fit model 
m <- lm(mpg ~ hp*wt + disp + cyl*gear, data=mtcars)

# extract all variables used in right hand side of the model 
variables <- as.character(attr(terms(m), "variables"))[-1L] 

# extract model coefficients 
s <- summary(m)$coeff
s

# discard unwanted rows corresponding to continuous predictors   
s1 <- s[rownames(s) %nin% variables[-1], ]
s1

# discard unwanted rows corresponding to categorical predictor gear
s2 <- s1[rownames(s1) %nin% paste0("gear",levels(mtcars$gear)), ]
s2

产出如下:

> s
            Estimate  Std. Error    t value     Pr(>|t|)
(Intercept) 43.80219399 5.977952455  7.3272905 4.404820e-07
hp          -0.11303405 0.040769042 -2.7725462 1.174817e-02
wt          -6.76725414 1.868428113 -3.6218970 1.699630e-03
disp        -0.00332244 0.014012039 -0.2371133 8.149808e-01
cyl6         2.87780626 3.180169670  0.9049222 3.762789e-01
cyl8         2.76416050 3.605957805  0.7665538 4.523007e-01
gear4        3.66877217 2.529082701  1.4506335 1.623857e-01
gear5        4.25400864 2.931068582  1.4513508 1.621881e-01
hp:wt        0.02401629 0.009953734  2.4127921 2.555079e-02
cyl6:gear4  -4.65994590 3.331261825 -1.3988531 1.771739e-01
cyl6:gear5  -3.86789967 4.400309909 -0.8790062 3.898369e-01
cyl8:gear5  -2.08842224 3.980538289 -0.5246582 6.055881e-01

> s1
           Estimate  Std. Error    t value     Pr(>|t|)
(Intercept) 43.80219399 5.977952455  7.3272905 4.404820e-07
cyl6         2.87780626 3.180169670  0.9049222 3.762789e-01
cyl8         2.76416050 3.605957805  0.7665538 4.523007e-01
gear4        3.66877217 2.529082701  1.4506335 1.623857e-01
gear5        4.25400864 2.931068582  1.4513508 1.621881e-01
hp:wt        0.02401629 0.009953734  2.4127921 2.555079e-02
cyl6:gear4  -4.65994590 3.331261825 -1.3988531 1.771739e-01
cyl6:gear5  -3.86789967 4.400309909 -0.8790062 3.898369e-01
cyl8:gear5  -2.08842224 3.980538289 -0.5246582 6.055881e-01

> s2
               Estimate  Std. Error    t value     Pr(>|t|)
(Intercept) 43.80219399 5.977952455  7.3272905 4.404820e-07
cyl6         2.87780626 3.180169670  0.9049222 3.762789e-01
cyl8         2.76416050 3.605957805  0.7665538 4.523007e-01
hp:wt        0.02401629 0.009953734  2.4127921 2.555079e-02
cyl6:gear4  -4.65994590 3.331261825 -1.3988531 1.771739e-01
cyl6:gear5  -3.86789967 4.400309909 -0.8790062 3.898369e-01
cyl8:gear5  -2.08842224 3.980538289 -0.5246582 6.055881e-01

可以循环所有分类预测变量以自动执行最后两个步骤。