使用两种不同语言(英语和韩语)写入相同数据集的lm的不同结果

时间:2015-01-30 02:11:52

标签: r lm

应用于以两种不同语言编写的两个数据集(数字变量+分类变量)的lm函数的结果(一个用英文写,另一个用韩文写)是不同的。除分类变量外,数字变量完全相同。有什么可以解释结果的差异?

#data 
df3 <- repmis::source_DropboxData("df3_v0.1.csv","gg30a74n4ew3zzg",header = TRUE)

#the one written in korean 
out1<-lm(YD~SANJI+TAmin8+TMINup18do6+typ_rain6+DTD9,data=df3)
summary(out1)

#the one written in eng 
df3$SANJI[df3$SANJI=="전북"]<-"JB"
df3$SANJI[df3$SANJI=="충북"]<-"CHB"
df3$SANJI[df3$SANJI=="경북"]<-"KB"
df3$SANJI[df3$SANJI=="전남"]<-"JN"
df3$SANJI2[df3$SANJI2=="고창"]<-"Gochang"
df3$SANJI2[df3$SANJI2=="괴산"]<-"Goesan"
df3$SANJI2[df3$SANJI2=="단양"]<-"Danyang"
df3$SANJI2[df3$SANJI2=="봉화"]<-"Fenghua"
df3$SANJI2[df3$SANJI2=="신안"]<-"Sinan"
df3$SANJI2[df3$SANJI2=="안동"]<-"Andong"
df3$SANJI2[df3$SANJI2=="영광"]<-"younggang"
df3$SANJI2[df3$SANJI2=="영양"]<-"youngyang"
df3$SANJI2[df3$SANJI2=="영주"]<-"youngju"
df3$SANJI2[df3$SANJI2=="예천"]<-"Yecheon"
df3$SANJI2[df3$SANJI2=="의성"]<-"Yusaeng"
df3$SANJI2[df3$SANJI2=="제천"]<-"Jechon"
df3$SANJI2[df3$SANJI2=="진안"]<-"Jinan"
df3$SANJI2[df3$SANJI2=="청송"]<-"Changsong"
df3$SANJI2[df3$SANJI2=="해남"]<-"Haenam"
out2<-lm(YD~SANJI+TAmin8+TMINup18do6+typ_rain6+DTD9,data=df3)
summary(out2)

#the one written in korean 
#Call:
#lm(formula = YD ~ SANJI + TAmin8 + TMINup18do6 + typ_rain6 + 
#    DTD9, data = df3)

#Residuals:
#    Min      1Q  Median      3Q     Max 
#-98.836 -23.173  -2.261  22.626 111.367 

#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
#(Intercept) 970.33251   84.12479  11.534  < 2e-16 ***
#SANJI전남   -33.75664   12.53277  -2.693 0.008158 ** 
#SANJI전북   -44.17939   11.22274  -3.937 0.000144 ***
#SANJI충북   -44.09285    9.16736  -4.810 4.74e-06 ***
#TAmin8      -25.56618    3.36053  -7.608 9.37e-12 ***
#TMINup18do6   4.58052    0.96528   4.745 6.19e-06 ***
#typ_rain6    -0.19754    0.02862  -6.903 3.23e-10 ***
#DTD9        -16.15975    2.65128  -6.095 1.59e-08 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#Residual standard error: 37.2 on 112 degrees of freedom
#Multiple R-squared:   0.58,    Adjusted R-squared:  0.5538 
#F-statistic:  22.1 on 7 and 112 DF,  p-value: < 2.2e-16


#the one written in eng 
#Call:
#lm(formula = YD ~ SANJI + TAmin8 + TMINup18do6 + typ_rain6 + 
#    DTD9, data = df3)

#Residuals:
#    Min      1Q  Median      3Q     Max 
#-98.836 -23.173  -2.261  22.626 111.367 

#Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
#(Intercept) 926.23966   84.32621  10.984  < 2e-16 ***
#SANJIJB      -0.08654   12.32752  -0.007    0.994    
#SANJIJN      10.33620   13.09434   0.789    0.432    
#SANJIKB      44.09285    9.16736   4.810 4.74e-06 ***
#TAmin8      -25.56618    3.36053  -7.608 9.37e-12 ***
#TMINup18do6   4.58052    0.96528   4.745 6.19e-06 ***
#typ_rain6    -0.19754    0.02862  -6.903 3.23e-10 ***
#DTD9        -16.15975    2.65128  -6.095 1.59e-08 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#Residual standard error: 37.2 on 112 degrees of freedom
#Multiple R-squared:   0.58,    Adjusted R-squared:  0.5538 
#F-statistic:  22.1 on 7 and 112 DF,  p-value: < 2.2e-16

1 个答案:

答案 0 :(得分:7)

您的整体模型拟合是相同的,您只需要为您的因子(“SANJIJ”)提供不同的参考类。具有不同的参考水平也会影响您的截距,但不会改变您的连续协变量的估计值。

您可以使用relevel()强制使用特定的引用类(假设SANJIJ已经是一个因素)或使用levels=参数显式创建factor(),否则默认顺序按字母顺序排序在不同的语言中,级别可能不会以相同的方式排序。