文章标题
2016-08-18 19:14
573 查看
回归分析
逻辑回归
案例一:婚外情研究
利用AER包中的自带数据集Affairs
install.packages(“AER”)library(AER)
data(Affairs,package=”AER”)
summary(Affairs)
table(Affairsgender)table(Affairschildren)
table(Affairs$affairs)
将目标变量转化为二值型结果
Affairsynaffairs[Affairsaffairs>0]<-1Affairsynaffairs[Affairsaffairs==0]<-0
Affairsynaffair<−factor(Affairsynaffair,levels = c(0,1),labels = c(“No”,”Yes”))
table(Affairs$ynaffair)
逻辑回归
fit.full<-glm(ynaffair~gender+age+yearsmarried+children+religiousness+education+occupation+rating,data = Affairs,family = binomial())summary(fit.full)
Call:
glm(formula = ynaffair ~ gender + age + yearsmarried + children +
religiousness + education + occupation + rating, family = binomial(),
data = Affairs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5713 -0.7499 -0.5690 -0.2539 2.5191
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.37726 0.88776 1.551 0.120807
gendermale 0.28029 0.23909 1.172 0.241083
age -0.04426 0.01825 -2.425 0.015301 *
yearsmarried 0.09477 0.03221 2.942 0.003262 **
childrenyes 0.39767 0.29151 1.364 0.172508
religiousness -0.32472 0.08975 -3.618 0.000297 *
education 0.02105 0.05051 0.417 0.676851
occupation 0.03092 0.07178 0.431 0.666630
rating -0.46845 0.09091 -5.153 2.56e-07 *
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 609.51 on 592 degrees of freedom
AIC: 627.51
Number of Fisher Scoring iterations: 4
剔除无关变量,重新拟合
fit.reduced<-glm(ynaffair~age+yearsmarried+religiousness+rating,data = Affairs,family = binomial())summary(fit.reduced)
Call:
glm(formula = ynaffair ~ age + yearsmarried + religiousness +
rating, family = binomial(), data = Affairs)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6278 -0.7550 -0.5701 -0.2624 2.3998
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.93083 0.61032 3.164 0.001558 **
age -0.03527 0.01736 -2.032 0.042127 *
yearsmarried 0.10062 0.02921 3.445 0.000571 *
religiousness -0.32902 0.08945 -3.678 0.000235 *
rating -0.46136 0.08884 -5.193 2.06e-07 *
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 615.36 on 596 degrees of freedom
AIC: 625.36
Number of Fisher Scoring iterations: 4
用卡方检验比较两个嵌套的模型
anova(fit.reduced,fit.full,test = “Chisq”)Analysis of Deviance Table
Model 1: ynaffair ~ age + yearsmarried + religiousness + rating
Model 2: ynaffair ~ gender + age + yearsmarried + children + religiousness +
education + occupation + rating
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 596 615.36
2 592 609.51 4 5.8474 0.2108
解析卡方检验的结果
结果显示卡方检验的结果为不显著(p=0.21),表名四个变量的模型与原来的九个完整变量的模型拟合得一样好。可以用更简单的模型进行解释。
解释模型参数
提取模型系数
coef(fit.reduced)(Intercept) age yearsmarried religiousness rating
1.93083017 -0.03527112 0.10062274 -0.32902386 -0.46136144
对数优势比解释性差,用指数
exp(coef(fit.reduced))(Intercept) age yearsmarried religiousness rating
6.8952321 0.9653437 1.1058594 0.7196258 0.6304248
随着婚龄的增加,年龄、宗教信仰、婚姻评分的降低,婚外情的优势比将上升。
评价预测变量对结果概率的影响
用控制变量法
例如
testdata<-data.frame(rating=c(1:5),age=mean(Affairsage),yearsmarried=mean(Affairsyearsmarried),religiousness=mean(Affairs$religiousness))testdata
使用测试数据集预测相应的概率
testdata$prob<-predict(fit.reduced,newdata = testdata,type = “response”)testdata
rating age yearsmarried religiousness prob
1 1 32.48752 8.177696 3.116473 0.5302296
2 2 32.48752 8.177696 3.116473 0.4157377
3 3 32.48752 8.177696 3.116473 0.3096712
4 4 32.48752 8.177696 3.116473 0.2204547
5 5 32.48752 8.177696 3.116473 0.1513079
可以看到随着婚姻评分的由1到5增加,婚外情的发生率由0.53降低到0.15
看年龄的影响
testdata<-data.frame(rating=mean(Affairsrating),age=seq(17,57,10),yearsmarried=mean(Affairsyearsmarried),religiousness=mean(Affairs$religiousness))testdata
使用测试数据集预测相应的概率
testdata$prob<-predict(fit.reduced,newdata = testdata,type = “response”)testdata
rating age yearsmarried religiousness prob
1 3.93178 17 8.177696 3.116473 0.3350834
2 3.93178 27 8.177696 3.116473 0.2615373
3 3.93178 37 8.177696 3.116473 0.1992953
4 3.93178 47 8.177696 3.116473 0.1488796
5 3.93178 57 8.177696 3.116473 0.1094738
检测过渡离势
所谓过度离势,即观测到的响应变量的方差大于期望的二项分布的方差。过度离势会导致奇异的标准误检验和不精确的显著性检验,当出现过度离势时,仍可使用glm()函数拟合Logistic回归,但此时需要将二项分布必为类二项分布(quasibinomial distribution)
检测过度离势的一种方法是比较二项分布模型的残差偏差与自由度,若比值:r=残差偏差/残差自由度 ,比1大很多,便认为是存在过度离势。
还可对过度离势进行检验:需拟合模型两次,第一次使用family=”binomial”,第二次使用family=”quasibinomial”
fit<-glm(ynaffair~age+yearsmarried+religiousness+rating,data = Affairs,family = binomial())fit.od<-glm(ynaffair~age+yearsmarried+religiousness+rating,data = Affairs,family = quasibinomial())
pchisq(summary(fit.od)dispersion∗fitdf.residual,fit$df.residual,lower=F)
[1] 0.340122
卡方检验的P值P=0.34>0.05,不显著,则接受源假设,不存在过离势现象。
扩展
R中扩展的Logistic回归和变种如下所示:
稳健Logistic回归:robust包中的glmRob()函数可用来拟合稳健的广义纯属模型,包括稳健Logistic回归;当拟合回归模型数据出现离群点和强影响点时,稳健Logistic思贤如渴便可泒上用场。
多项式回归:若响应变量包含两个以上的无序类别,便可使用mlogit包中的mlogit()函数拟合多项Logistic回归;
序数Logistic回归:若响应变量是一有序的类别,便可使用rms包中的lrm()函数拟合Logistic回归
目标变量是计数型数据,用泊松回归
install.packages(“robust”)library(robust)
利用包中自带的癫痫病发病率数据
data(breslow.dat,package=”robust”)summary(breslow.dat)
看目标变量的分布
opar<-par(no.readonly = TRUE)par(mfrow=c(1,2))
attach(breslow.dat)
hist(sumY,breaks=20,xlab = “Seizure Count”,main=”Distribution of Seizures”)
boxplot(sumY~Trt,xlab=”Treatment”,main=”Group Comparisons”)
par(opar)
从图中可以清楚地看到因变量的偏倚特性以及可能的离群点。泊松分布中,较小的方差伴随着较小的均值,则可以看到药物治疗下癫痫发病数似乎变小了,且方差也变小了
与标准最小二乘法不同,泊松回归不关注方差异质性
建立回归模型
fit<-glm(sumY~Base+Age+Trt,data = breslow.dat,family = poisson())summary(fit)
Call:
glm(formula = sumY ~ Base + Age + Trt, family = poisson(), data = breslow.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.0569 -2.0433 -0.9397 0.7929 11.0061
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9488259 0.1356191 14.370 < 2e-16 *
Base 0.0226517 0.0005093 44.476 < 2e-16 *
Age 0.0227401 0.0040240 5.651 1.59e-08 *
Trtprogabide -0.1527009 0.0478051 -3.194 0.0014 **
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2122.73 on 58 degrees of freedom
Residual deviance: 559.44 on 55 degrees of freedom
AIC: 850.71
Number of Fisher Scoring iterations: 5
泊松回归的过度离势问题
泊松分布的方差和均值相等。当响应变量观测的方差比依据泊松分布预测的方差大时,泊松回归可能造成负面影响
可能发生过度离势的原因有:1、遗漏了某个重要的预测变量;2、事件相关;3、纵向数据分析中,内部的群聚性
过度离势的现象:很小的标准误和置信区间,并且显著性检验也过于轻松(即:你会发现并不真实存在的效应)
检验该案例的过度离势问题
与Logistic回归类似,此处如果残差偏差与残差自由度的比例远远大于1,则表明存在过离势
559.44/55=10.17>>1qcc包检验过离势现象
library(qcc)qcc.overdispersion.test(breslow.dat$sumY,type = “poisson”)
Overdispersion test Obs.Var/Theor.Var Statistic p-value
poisson data 62.87013 3646.468 0
P=0<0.05 拒绝原假设,即存在过离势现象
用类泊松进行泊松回归,以此处理过度离势
fit.od<-glm(sumY~Base+Age+Trt,data = breslow.dat,family = quasipoisson())summary(fit.od)
Call:
glm(formula = sumY ~ Base + Age + Trt, family = quasipoisson(),
data = breslow.dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.0569 -2.0433 -0.9397 0.7929 11.0061
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.948826 0.465091 4.190 0.000102 *
Base 0.022652 0.001747 12.969 < 2e-16 *
Age 0.022740 0.013800 1.648 0.105085
Trtprogabide -0.152701 0.163943 -0.931 0.355702
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for quasipoisson family taken to be 11.76075)
Null deviance: 2122.73 on 58 degrees of freedom
Residual deviance: 559.44 on 55 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5