优化样条回归中的自由度

时间:2018-08-18 19:18:27

标签: r regression spline gam mgcv

我有两个基因表达时程数据集:

首先,在4个组的14个时间点测量基因表达:

df1 <- structure(list(val = c(-0.1, -0.13, -0.4, -0.3, -0.3, -0.2, -0.24, 
                            0.1, 0.2, 0.13, 0, 0.63, 0.83, 0.85, -0.07, -0.07, -0.27, -0.2, 
                            -0.2, -0.1, 0.2, 0.1, 0.07, 0.17, 0.6, 0.75, 1.1, 1.1, -0.13, 
                            -0.15, -0.26, -0.25, -0.14, 0.04, 0.2, 0.24, 0.23, 0.2, 0.1, 
                            0.73, 1, 1.3, 0, 0.06, -0.24, -0.17, -0.17, -0.04, 0.16, 0.1, 
                            0.14, 0.27, 0.34, 0.9, 0.97, 1.04), 
                    time = c(-1, 0, 1, 1.58,2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                             -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                             -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17,7.39, 
                             -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58,6.17, 7.39), 
                    group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                        2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 
                                        3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,3L, 3L, 3L, 
                                        4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,4L), 
                                      .Label = c("a", "b", "c", "d"), class = "factor")), .Names = c("val","time", "group"), 
               row.names = c(NA, -56L), class = "data.frame")


df1$group <- factor(df1$group,levels=c("a","b","c","d"))

如下所示(添加了loess平滑趋势线):

library(ggplot2)
ggplot(df1,aes(x=time,y=val,color=group))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

enter image description here

第二,在相似的14个时间点测量了基因表达,但现在从2个不同的组中分别以两个性别表示:

df2 <- structure(list(val = c(-0.23, -0.01, -0.14, -0.01, -0.21, -0.16, 
                       -0.24, -0.11, 0.02, -0.11, -0.01, -0.25, -0.47, -1.25, 0.02, 
                       -0.3, -0.02, 0.14, 0.25, -0.05, 0.15, 0.11, -0.24, -0.18, -0.39, 
                       -0.49, -0.5, -0.65, -0.06, 0.09, 0.1, 0.15, 0.08, 0.15, 0.4, 
                       0.24, 0.07, 0.08, -0.18, -0.35, -0.19, -0.81, -0.16, 0.29, -0.05, 
                       0.14, 0.14, 0.48, 0.34, 0.11, -0.07, -0.13, -0.41, -0.22, -0.54, 
                       -0.76, 0.35, 0.34, -0.06, 0.21, 0.14, 0.14, 0.25, 0.22, 0.25, 
                       0.16, 0.3, 0.44, 0.08, 0.48, 0.1, 0.16, -0.03, -0.22, 0.2, 0.01, 
                       -0.09, -0.02, -0.01, 0.06, -0.13, 0.19, 0.11, -0.04, -0.39, 0.03, 
                       -0.01, 0.09, 0.1, -0.14, -0.12, -0.1, 0.36, 0.08, 0.09, 0.09, 
                       0.42, 0.37, -0.14, 0.12, 0.09, 0.03, 0.06, -0.25, 0.2, -0.06, 
                       -0.44, 0.23, 0.03, 0.16, 0.81, 0.83),
               time = c(-1, 0, 1, 1.58,2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                        -1, 0,1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17,7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58,6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58,5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17,4.58, 5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39, 
                        -1, 0, 1, 1.58, 2, 2.58, 3, 3.32, 3.58, 4.17, 4.58, 5.58, 6.17, 7.39), 
               sex = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                                 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
                               .Label = c("F", "M"), class = "factor"), group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                                                                            2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), 
                                                                                          .Label = c("a", "b"), class = "factor")), .Names = c("val", "time", "sex", "group"), row.names = c(NA, -112L), class = "data.frame")
df2$sex <- ordered(df2$sex,levels=c("M","F"))

df2$group <- ordered(df2$group,levels=c("a","b"))

df2$col <- factor(paste0(df2$group,":",df2$sex))

看起来像这样(添加一条黄土平滑趋势线):

ggplot(df2,aes(x=time,y=val,color=col))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

enter image description here

对于df1,我想估算timeval的影响,调整为group

对于df2,我想估算time:groupval的影响,调整为sex

查看我认为使用spline regression的数据是合适的,所以我使用了gam包中的mgcv函数,据我所知它可以优化符合数据的spline的自由度。

这就是我适合df1的条件:

mgcv1.fit <- mgcv::gam(val ~ group+s(time),data=df1)

哪个给:

Family: gaussian 
Link function: identity 

Formula:
val ~ group + s(time)

Estimated degrees of freedom:
7.18  total = 11.18 

GCV score: 0.01258176     

但是对于这些数据,7.18自由度似乎太多了。

对于df2

mgcv2.fit <- mgcv::gam(val ~ sex+s(time,by=group),data=df2)

给出:

Family: gaussian 
Link function: identity 

Formula:
val ~ sex + s(time, by = group)

Estimated degrees of freedom:
1.72  total = 3.72 

GCV score: 0.08522094     

我猜想在这种情况下,我可以想象自由度会更高。

还有一点。绘制这两个数据集的拟合值:

df1$mgcv <- mgcv1.fit$fitted.values
ggplot(df1,aes(x=time,y=mgcv,color=group))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

enter image description here

看起来不错。

但是对于df2

df2$mgcv <- mgcv2.fit$fitted.values
ggplot(df2,aes(x=time,y=mgcv,color=col))+geom_point()+theme_minimal()+geom_smooth(se=F)+theme(legend.position="top",legend.title=element_blank())

enter image description here

好像翻转了组标签。

所以我的问题是:

  1. 我是否正确使用mgcv::gam来优化我的问题的样条自由度?
  2. mgcv是否对其fitted.values中的样本重新排序?

1 个答案:

答案 0 :(得分:4)

首先,mgcv在要素水平上做正确的事情。如果您选中str(df2$sex),您将看到“ M”(男性)是第一级,而“ F”(女性)是第二级。但是从str(df2$col)看来,“ F”是第一个,因此在绘制情节时会出现标签错误的情况。

第二,您的第二个模型未正确指定。

  1. 如果没有“ by”变量,或者“ by”是一个因数,则样条线s(time)受居中约束。因此,您可以在模型公式中将“ by”变量group作为单独的术语提供,以捕获其边际效应;
  2. 由于“ by”变量group是有序变量,因此mgcv在其上施加对比度,从而在构造s(time, by = group)时删除了第一级“ a”。因此,您需要提供一个单独的s(time)作为基准平滑度。

您当前的mgcv2.fit是一个相当差的模型(不足为奇),说明的偏差为9%。但是,如果执行以下操作,您将获得64%的收益。

gam(val ~ sex + s(time) + group + s(time, by = group), data = df2, method = "REML")

ggplot现在看起来正确(我没有更改df2$col,所以颜色可能仍然相反)。

gam默认使用“ GCV.Cp”作为平滑参数选择方法。但建议使用“ REML”,因为它不太容易过拟合。


备注1

如果“ by”变量group是一个(无序的)因子,则它不会受到对比度的影响。因此,模型公式应为:

val ~ sex + group + s(time, by = group)

以下引自?gam.models'by'变量部分:

 If a ‘by’ variable is a ‘factor’ then it generates an indicator
 vector for each level of the factor, unless it is an ‘ordered’
 factor. In the non-ordered case, the model matrix for the smooth
 term is then replicated for each factor level, and each copy has
 its rows multiplied by the corresponding rows of its indicator
 variable. The smoothness penalties are also duplicated for each
 factor level.  In short a different smooth is generated for each
 factor level (the ‘id’ argument to ‘s’ and ‘te’ can be used to
 force all such smooths to have the same smoothing parameter).
 ‘ordered’ ‘by’ variables are handled in the same way, except that
 no smooth is generated for the first level of the ordered factor
 (see ‘b3’ example below).  This is useful for setting up
 identifiable models when the same smooth occurs more than once in
 a model, with different factor ‘by’ variables.

备注2

我不是要判断您的模型,但是“ F”和“ M”之间似乎存在明显的组内差异。从您的数据中可以看出,“ b”组中的“ F”和“ M”比“ a”组中的差异更大。目前,sex的效果在两个组中都是相同的,并且只是一个垂直移位。您可以在此答案的上方ggplot中观察到这一点。最终由您决定模型,但是如果您想对此sex-group交互进行建模,则可以

df2$sex_group <- with(df2, interaction(sex, group))  ## the new variable is unordered
test <- gam(val ~ sex + group + s(time, by = sex_group), data = df2, method = "REML")

请注意,我如何向by提供两个因子变量。将创建一个辅助变量sex_group

相关问题