如何使用R中的面板数据进行回归分析?

时间:2018-08-12 02:10:38

标签: r dataframe regression analysis panel-data

所以我是R的菜鸟,自从使用R以来已经一年多了,我似乎已经忘记了很多... :(

我有一个面板数据,其中包括来自不同国家的2005、2010和2015年的观察结果,如下所示:

   Location Year Health_Spending Total NCD Deaths_male
1      CAN 2005        3282.454                 101.4
2      CAN 2010        4225.189                 105.5
3      CAN 2015        4632.837                 109.2
4      ESP 2005        2126.553                 179.9
5      ESP 2010        2882.912                 180.6
6      ESP 2015        3175.457                 183.1
  Total NCD Deaths_female
1                   102.7
2                   107.3
3                   110.2
4                   170.4
5                   170.6
6                   180.8

我正在尝试使用Health_Spending作为Y以及总NCD Deaths_male和Total NCD Deaths_female作为X1和X2进行回归分析。

我一直在查找,似乎plm软件包经常用于分析R中的面板数据,但是我在弄清楚如何使用它方面遇到了麻烦。

善良的灵魂可以帮助我并指导我该做些什么吗?

(这是我的数据的Dput版本,以防万一)

    structure(list(Location = c("CAN", "CAN", "CAN", "ESP", "ESP", 
"ESP", "GBR", "GBR", "GBR", "ISR", "ISR", "ISR", "JPN", "JPN", 
"JPN", "KOR", "KOR", "KOR", "MEX", "MEX", "MEX", "NLD", "NLD", 
"NLD", "NOR", "NOR", "NOR", "POL", "POL", "POL", "TUR", "TUR", 
"TUR", "USA", "USA", "USA"), Year = c(2005L, 2010L, 2015L, 2005L, 
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
2010L, 2015L, 2005L, 2010L, 2015L), Health_Spending = c(3282.454, 
4225.189, 4632.837, 2126.553, 2882.912, 3175.457, 2331.136, 3040.114, 
4071.806, 1768.952, 2032.725, 2646.915, 2463.725, 3205.216, 4428.349, 
1183.438, 1895.699, 2481.587, 730.816, 911.351, 1037.424, 3454.707, 
4633.738, 5148.399, 3980.768, 5162.669, 6239.435, 806.974, 1352.424, 
1687.009, 582.888, 871.677, 1028.911, 6443.02, 7939.798, 9491.4
), `Total NCD Deaths_male` = c("101.4", "105.5", "109.2", "179.9", 
"180.6", "183.1", "245.8", "242.0", "249.0", "16.7", "16.8", 
"18.0", "460.3", "503.7", "543.2", "105.7", "110.2", "118.3", 
"194.7", "230.7", "257.5", "58.9", "58.6", "63.2", "17.4", "17.5", 
"17.1", "172.7", "175.1", "175.9", "185.3", "197.4", "211.8", 
"1024.9", "1061.6", "1159.5"), `Total NCD Deaths_female` = c("102.7", 
"107.3", "110.2", "170.4", "170.6", "180.8", "268.2", "259.0", 
"264.1", "17.5", "17.4", "18.7", "405.0", "458.9", "528.4", "92.9", 
"93.3", "102.2", "181.4", "214.2", "235.5", "62.1", "62.6", "67.7", 
"18.4", "18.8", "18.2", "163.1", "168.6", "174.6", "150.3", "162.6", 
"181.0", "1111.6", "1115.5", "1183.4")), .Names = c("Location", 
"Year", "Health_Spending", "Total NCD Deaths_male", "Total NCD Deaths_female"
), class = "data.frame", row.names = c(NA, -36L))

1 个答案:

答案 0 :(得分:0)

我假设您想使用标准的多元回归方法。而且,您可以使用lm(Health_Spending~Total.NCD.Deaths_male + Total.NCD.Deaths_female + Location, data = df)轻松地做到这一点。只需确保变量Total NCD Deaths_maleTotal NCD Deaths_female的类型为numeric,变量Location的类型为categorical

以下代码段将向您展示如何更改数据类型,构建模型和报告模型结果。


# Data
df <- data.frame(structure(list(Location = c("CAN", "CAN", "CAN", "ESP", "ESP", 
                "ESP", "GBR", "GBR", "GBR", "ISR", "ISR", "ISR", "JPN", "JPN", 
                "JPN", "KOR", "KOR", "KOR", "MEX", "MEX", "MEX", "NLD", "NLD", 
                "NLD", "NOR", "NOR", "NOR", "POL", "POL", "POL", "TUR", "TUR", 
                "TUR", "USA", "USA", "USA"), Year = c(2005L, 2010L, 2015L, 2005L, 
                2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
                2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
                2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 2010L, 2015L, 2005L, 
                2010L, 2015L, 2005L, 2010L, 2015L), Health_Spending = c(3282.454, 
                4225.189, 4632.837, 2126.553, 2882.912, 3175.457, 2331.136, 3040.114, 
                4071.806, 1768.952, 2032.725, 2646.915, 2463.725, 3205.216, 4428.349, 
                1183.438, 1895.699, 2481.587, 730.816, 911.351, 1037.424, 3454.707, 
                4633.738, 5148.399, 3980.768, 5162.669, 6239.435, 806.974, 1352.424, 
                1687.009, 582.888, 871.677, 1028.911, 6443.02, 7939.798, 9491.4
                ), `Total NCD Deaths_male` = c("101.4", "105.5", "109.2", "179.9", 
                "180.6", "183.1", "245.8", "242.0", "249.0", "16.7", "16.8", 
                "18.0", "460.3", "503.7", "543.2", "105.7", "110.2", "118.3", 
                "194.7", "230.7", "257.5", "58.9", "58.6", "63.2", "17.4", "17.5", 
                "17.1", "172.7", "175.1", "175.9", "185.3", "197.4", "211.8", 
                "1024.9", "1061.6", "1159.5"), `Total NCD Deaths_female` = c("102.7", 
                "107.3", "110.2", "170.4", "170.6", "180.8", "268.2", "259.0", 
                "264.1", "17.5", "17.4", "18.7", "405.0", "458.9", "528.4", "92.9", 
                "93.3", "102.2", "181.4", "214.2", "235.5", "62.1", "62.6", "67.7", 
                "18.4", "18.8", "18.2", "163.1", "168.6", "174.6", "150.3", "162.6", 
                "181.0", "1111.6", "1115.5", "1183.4")), .Names = c("Location", 
                "Year", "Health_Spending", "Total NCD Deaths_male", "Total NCD Deaths_female"
                ), class = "data.frame", row.names = c(NA, -36L)))


# Data transformation
df$Health_Spending <- as.numeric(df$Health_Spending)
df$Location <- as.factor(df$Location)
df$Total.NCD.Deaths_male <- as.numeric(df$Total.NCD.Deaths_male)
df$Total.NCD.Deaths_female <- as.numeric(df$Total.NCD.Deaths_female)

# Model and model summary
m <- lm(Health_Spending~Total.NCD.Deaths_male + Total.NCD.Deaths_female + Location, data = df)
summary(m)

在摘要中,您将找到除“加拿大”以外的所有位置作为解释性因素变量。这是因为加拿大已被自动选择为与所有其他位置进行比较的参考变量。在模型摘要中,您可以看到Total.NCD.Deaths_female被认为是不重要的,而Total.NCD.Deaths_male被认为在10%的水平上是次要的(用'。'表示)

一些警告语


在构建模型之前,您应始终注意数据的结构。如果您决定删除模型中的Location变量,则会得到非常不同的结果,甚至可能得出结论,Total.NCD.Deaths_maleTotal.NCD.Deaths_female这两个变量都非常重要:

Call:
lm(formula = Health_Spending ~ Total.NCD.Deaths_male + Total.NCD.Deaths_female, 
    data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2319.8 -1176.5    56.6   943.0  3535.2 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              2615.17     362.78   7.209 2.89e-08 ***
Total.NCD.Deaths_male     -36.39      11.83  -3.077  0.00418 ** 
Total.NCD.Deaths_female    39.08      11.34   3.447  0.00156 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1509 on 33 degrees of freedom
Multiple R-squared:  0.5136,    Adjusted R-squared:  0.4841 
F-statistic: 17.42 on 2 and 33 DF,  p-value: 6.849e-06

但是,由于数据集的结构,这将是一个令人误解的结论:

enter image description here

如您所见,所有位置均出现多次。 Year也是如此。而且,如果不对数据进行子集设置,则更简单的model m <- lm(Health_Spending~Total.NCD.Deaths_male + Total.NCD.Deaths_female, data = df)将不会考虑这一点。使用Location作为因子变量可以在某种程度上进行补救,但是您还应该考虑将Year用作类型为numericcategorical的解释变量,或者可以通过其他方式将时间元素考虑在内-可能是通过将数据集划分为不同的时间段。

我希望这是您想要的。不要犹豫,让我知道。