您的位置:首页 > 其它

文章标题

2016-08-18 19:14 573 查看

回归分析

逻辑回归

案例一:婚外情研究

利用AER包中的自带数据集Affairs

install.packages(“AER”)

library(AER)

data(Affairs,package=”AER”)

summary(Affairs)

table(Affairsgender)table(Affairschildren)

table(Affairs$affairs)

将目标变量转化为二值型结果

Affairsynaffairs[Affairsaffairs>0]<-1

Affairsynaffairs[Affairsaffairs==0]<-0

Affairsynaffair<−factor(Affairsynaffair,levels = c(0,1),labels = c(“No”,”Yes”))

table(Affairs$ynaffair)

逻辑回归

fit.full<-glm(ynaffair~gender+age+yearsmarried+children+religiousness+education+occupation+rating,data = Affairs,family = binomial())

summary(fit.full)

Call:

glm(formula = ynaffair ~ gender + age + yearsmarried + children +

religiousness + education + occupation + rating, family = binomial(),

data = Affairs)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.5713 -0.7499 -0.5690 -0.2539 2.5191

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.37726 0.88776 1.551 0.120807

gendermale 0.28029 0.23909 1.172 0.241083

age -0.04426 0.01825 -2.425 0.015301 *

yearsmarried 0.09477 0.03221 2.942 0.003262 **

childrenyes 0.39767 0.29151 1.364 0.172508

religiousness -0.32472 0.08975 -3.618 0.000297 *

education 0.02105 0.05051 0.417 0.676851

occupation 0.03092 0.07178 0.431 0.666630

rating -0.46845 0.09091 -5.153 2.56e-07 *

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 675.38  on 600  degrees of freedom


Residual deviance: 609.51 on 592 degrees of freedom

AIC: 627.51

Number of Fisher Scoring iterations: 4

剔除无关变量,重新拟合

fit.reduced<-glm(ynaffair~age+yearsmarried+religiousness+rating,data = Affairs,family = binomial())

summary(fit.reduced)

Call:

glm(formula = ynaffair ~ age + yearsmarried + religiousness +

rating, family = binomial(), data = Affairs)

Deviance Residuals:

Min 1Q Median 3Q Max

-1.6278 -0.7550 -0.5701 -0.2624 2.3998

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.93083 0.61032 3.164 0.001558 **

age -0.03527 0.01736 -2.032 0.042127 *

yearsmarried 0.10062 0.02921 3.445 0.000571 *

religiousness -0.32902 0.08945 -3.678 0.000235 *

rating -0.46136 0.08884 -5.193 2.06e-07 *

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 675.38  on 600  degrees of freedom


Residual deviance: 615.36 on 596 degrees of freedom

AIC: 625.36

Number of Fisher Scoring iterations: 4

用卡方检验比较两个嵌套的模型

anova(fit.reduced,fit.full,test = “Chisq”)

Analysis of Deviance Table

Model 1: ynaffair ~ age + yearsmarried + religiousness + rating

Model 2: ynaffair ~ gender + age + yearsmarried + children + religiousness +

education + occupation + rating

Resid. Df Resid. Dev Df Deviance Pr(>Chi)

1 596 615.36

2 592 609.51 4 5.8474 0.2108

解析卡方检验的结果

结果显示卡方检验的结果为不显著(p=0.21),表名四个变量的模型与原来的九个完整变量的模型拟合得一样好。可以用更简单的模型进行解释。

解释模型参数

提取模型系数

coef(fit.reduced)

(Intercept) age yearsmarried religiousness rating

1.93083017 -0.03527112 0.10062274 -0.32902386 -0.46136144

对数优势比解释性差,用指数

exp(coef(fit.reduced))

(Intercept) age yearsmarried religiousness rating

6.8952321 0.9653437 1.1058594 0.7196258 0.6304248

随着婚龄的增加,年龄、宗教信仰、婚姻评分的降低,婚外情的优势比将上升。

评价预测变量对结果概率的影响

用控制变量法

例如

testdata<-data.frame(rating=c(1:5),age=mean(Affairsage),yearsmarried=mean(Affairsyearsmarried),religiousness=mean(Affairs$religiousness))

testdata

使用测试数据集预测相应的概率

testdata$prob<-predict(fit.reduced,newdata = testdata,type = “response”)

testdata

rating age yearsmarried religiousness prob

1 1 32.48752 8.177696 3.116473 0.5302296

2 2 32.48752 8.177696 3.116473 0.4157377

3 3 32.48752 8.177696 3.116473 0.3096712

4 4 32.48752 8.177696 3.116473 0.2204547

5 5 32.48752 8.177696 3.116473 0.1513079

可以看到随着婚姻评分的由1到5增加,婚外情的发生率由0.53降低到0.15

看年龄的影响

testdata<-data.frame(rating=mean(Affairsrating),age=seq(17,57,10),yearsmarried=mean(Affairsyearsmarried),religiousness=mean(Affairs$religiousness))

testdata

使用测试数据集预测相应的概率

testdata$prob<-predict(fit.reduced,newdata = testdata,type = “response”)

testdata

rating age yearsmarried religiousness prob

1 3.93178 17 8.177696 3.116473 0.3350834

2 3.93178 27 8.177696 3.116473 0.2615373

3 3.93178 37 8.177696 3.116473 0.1992953

4 3.93178 47 8.177696 3.116473 0.1488796

5 3.93178 57 8.177696 3.116473 0.1094738

检测过渡离势

所谓过度离势,即观测到的响应变量的方差大于期望的二项分布的方差。过度离势会导致奇异的标准误检验和不精确的显著性检验,当出现过度离势时,仍可使用glm()函数拟合Logistic回归,但此时需要将二项分布必为类二项分布(quasibinomial distribution)

检测过度离势的一种方法是比较二项分布模型的残差偏差与自由度,若比值:r=残差偏差/残差自由度 ,比1大很多,便认为是存在过度离势。

还可对过度离势进行检验:需拟合模型两次,第一次使用family=”binomial”,第二次使用family=”quasibinomial”

fit<-glm(ynaffair~age+yearsmarried+religiousness+rating,data = Affairs,family = binomial())

fit.od<-glm(ynaffair~age+yearsmarried+religiousness+rating,data = Affairs,family = quasibinomial())

pchisq(summary(fit.od)dispersion∗fitdf.residual,fit$df.residual,lower=F)

[1] 0.340122

卡方检验的P值P=0.34>0.05,不显著,则接受源假设,不存在过离势现象。

扩展

R中扩展的Logistic回归和变种如下所示:

稳健Logistic回归:robust包中的glmRob()函数可用来拟合稳健的广义纯属模型,包括稳健Logistic回归;当拟合回归模型数据出现离群点和强影响点时,稳健Logistic思贤如渴便可泒上用场。

多项式回归:若响应变量包含两个以上的无序类别,便可使用mlogit包中的mlogit()函数拟合多项Logistic回归;

序数Logistic回归:若响应变量是一有序的类别,便可使用rms包中的lrm()函数拟合Logistic回归

目标变量是计数型数据,用泊松回归

install.packages(“robust”)

library(robust)

利用包中自带的癫痫病发病率数据

data(breslow.dat,package=”robust”)

summary(breslow.dat)

看目标变量的分布

opar<-par(no.readonly = TRUE)

par(mfrow=c(1,2))

attach(breslow.dat)

hist(sumY,breaks=20,xlab = “Seizure Count”,main=”Distribution of Seizures”)

boxplot(sumY~Trt,xlab=”Treatment”,main=”Group Comparisons”)

par(opar)

从图中可以清楚地看到因变量的偏倚特性以及可能的离群点。泊松分布中,较小的方差伴随着较小的均值,则可以看到药物治疗下癫痫发病数似乎变小了,且方差也变小了

与标准最小二乘法不同,泊松回归不关注方差异质性

建立回归模型

fit<-glm(sumY~Base+Age+Trt,data = breslow.dat,family = poisson())

summary(fit)

Call:

glm(formula = sumY ~ Base + Age + Trt, family = poisson(), data = breslow.dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-6.0569 -2.0433 -0.9397 0.7929 11.0061

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.9488259 0.1356191 14.370 < 2e-16 *

Base 0.0226517 0.0005093 44.476 < 2e-16 *

Age 0.0227401 0.0040240 5.651 1.59e-08 *

Trtprogabide -0.1527009 0.0478051 -3.194 0.0014 **

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2122.73  on 58  degrees of freedom


Residual deviance: 559.44 on 55 degrees of freedom

AIC: 850.71

Number of Fisher Scoring iterations: 5

泊松回归的过度离势问题

泊松分布的方差和均值相等。当响应变量观测的方差比依据泊松分布预测的方差大时,泊松回归可能造成负面影响

可能发生过度离势的原因有:1、遗漏了某个重要的预测变量;2、事件相关;3、纵向数据分析中,内部的群聚性

过度离势的现象:很小的标准误和置信区间,并且显著性检验也过于轻松(即:你会发现并不真实存在的效应)

检验该案例的过度离势问题

与Logistic回归类似,此处如果残差偏差与残差自由度的比例远远大于1,则表明存在过离势

559.44/55=10.17>>1

qcc包检验过离势现象

library(qcc)

qcc.overdispersion.test(breslow.dat$sumY,type = “poisson”)

Overdispersion test Obs.Var/Theor.Var Statistic p-value

poisson data 62.87013 3646.468 0

P=0<0.05 拒绝原假设,即存在过离势现象

用类泊松进行泊松回归,以此处理过度离势

fit.od<-glm(sumY~Base+Age+Trt,data = breslow.dat,family = quasipoisson())

summary(fit.od)

Call:

glm(formula = sumY ~ Base + Age + Trt, family = quasipoisson(),

data = breslow.dat)

Deviance Residuals:

Min 1Q Median 3Q Max

-6.0569 -2.0433 -0.9397 0.7929 11.0061

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.948826 0.465091 4.190 0.000102 *

Base 0.022652 0.001747 12.969 < 2e-16 *

Age 0.022740 0.013800 1.648 0.105085

Trtprogabide -0.152701 0.163943 -0.931 0.355702

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 11.76075)

Null deviance: 2122.73  on 58  degrees of freedom


Residual deviance: 559.44 on 55 degrees of freedom

AIC: NA

Number of Fisher Scoring iterations: 5

R提供了基本泊松回归模型的一些有用扩展,包括允许时间段变化、存在过多0时会自动修正的模型,以及当数据存在离群点和强影响点时有用的稳健模型。

1、时间段变化的泊松回归

fit<-glm(sumY~Base+Age+Trt,data=breslow.dat,offset=log(time),family=poisson())

2、零膨胀的泊松回归

在一个数据集中,0计数的数目时常比用泊松模型预测的数目多。当总体的一个子群体无任何被计数的行为时,就可能发生这种情况。

用pscl包中的zeroinfl()函数做零膨胀泊松回归

3、稳健泊松回归

robust包中的glmRob()函数可拟合稳健广义线性模型,包含稳健泊松回归。当存在离群点和强影响点时,该方法很有效

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  回归分析