ISLR 4.6 Lab: Logistic Regression, LDA, QDA, and KNN
2015-07-29 04:24
567 查看
4.6.1TheStockMarketData
Thecor()functionproducesamatrixthatcontainsallofthepairwisecorrelationsamongthepredictorsinadataset.ThefirstcommandbelowgivesanerrormessagebecausetheDirectionvariableisqualitative.这个还挺有意思的
4.6.2LogisticRegression
Theglm()functionfitsgeneralizedglm()linearmodels,aclassofmodelsthatincludeslogisticregression.Thesyntax
generalizedoftheglm()functionissimilartothatoflm(),exceptthatwemustpassinlinearmodeltheargumentfamily=binomialinordertotellRtorunalogisticregressionratherthansomeothertypeofgeneralizedlinearmodel.
分析“
Thesmallestp-valuehereisassociatedwithLag1.Thenegativecoefficientforthispredictorsuggeststhatifthemarkethadapositivereturnyesterday,thenitislesslikelytogouptoday.However,atavalueof0.15,thep-valueisstillrelativelylarge,andsothereisnoclearevidenceofarealassociationbetweenLag1andDirection.
”
看具体的参数
coef()functioninordertoaccessjustthecoefficientsforthisfittedmodel.Wecanalsousethesummary()functiontoaccessparticularaspectsofthefittedmodel,suchasthep-valuesforthecoefficients.
结果预测
Thepredict()functioncanbeusedtopredicttheprobabilitythatthemarketwillgoup,givenvaluesofthepredictors.
Thetype="response"optiontellsRtooutputprobabilitiesoftheformP(Y=1|X),asopposedtootherinformationsuchasthelogit.
之后
Thefirstcommandcreatesavectorof1,250Downelements.ThesecondlinetransformstoUpalloftheelementsforwhichthepredictedprobabilityofamarketincreaseexceeds0.5.Giventhesepredictions,thetable()functiontable()canbeusedtoproduceaconfusionmatrixinordertodeterminehowmanyobservationswerecorrectlyorincorrectlyclassified.
>table(glm.pred,Direction)
Direction
glm.predDownUp
Up457507
Down145141
Crossvalidationcreateaheldoutdatasetofobservationsfrom2005.
nowfitalogisticregressionmodelusingonlythesubsetoftheobservationsthatcorrespondtodatesbefore2005,usingthesubsetargument.Wethenobtainpredictedprobabilitiesofthestockmarketgoingupforeachofthedaysinourtestset—thatis,forthedaysin2005.
混乱,不继续这部分了。
4.6.3LinearDiscriminantAnalysis
NowwewillperformLDAontheSmarketdata.InR,wefitaLDAmodelusingthelda()function,whichispartoftheMASSlibrary.
TheLDAoutputindicatesthatˆπ1=0.492andˆπ2=0.508;inotherwords,49.2%ofthetrainingobservationscorrespondtodaysduringwhichthemarketwentdown.Italsoprovidesthegroupmeans;thesearetheaverageofeachpredictorwithineachclass,andareusedbyLDAasestimatesofμk.Thesesuggestthatthereisatendencyfortheprevious2days’returnstobenegativeondayswhenthemarketincreases,andatendencyforthepreviousdays’returnstobepositiveondayswhenthemarketdeclines.ThecoefficientsoflineardiscriminantsoutputprovidesthelinearcombinationofLag1andLag2thatareusedtoformtheLDAdecisionrule.
If−0.642×Lag1−0.514×Lag2islarge,thentheLDAclassifierwillpredictamarketincrease,andifitissmall,thentheLDAclassifierwillpredictamarketdecline.Theplot()functionproducesplotsofthelineardiscriminants,obtainedbycomputing−0.642×Lag1−0.514×Lag2foreachofthetrainingobservations..
class,containsLDA’spredictionsaboutthemovementofthemarket.
Thesecondelement,posterior,isamatrixwhosekthcolumncontainsthe
posteriorprobabilitythatthecorrespondingobservationbelongstothekth
class,computedfrom(4.10).Finally,xcontainsthelineardiscriminants,
describedearlier.
4.6.4QuadraticDiscriminantAnalysis
WewillnowfitaQDAmodeltotheSmarketdata.QDAisimplementedinRusingtheqda()function,whichisalsopartoftheMASSlibrary.
Theoutputcontainsthegroupmeans.Butitdoesnotcontainthecoefficientsofthelineardiscriminants,becausetheQDAclassifierinvolvesaquadratic,ratherthanalinear,functionofthepredictors.Thepredict()functionworksinexactlythesamefashionasforLDA.
4.6.5K-NearestNeighbors
performKNNusingtheknn()function,whichispartoftheclasslibrary.
Thefunctionrequiresfourinputs.
1.Amatrixcontainingthepredictorsassociatedwiththetrainingdata,labeledtrain.Xbelow.
2.Amatrixcontainingthepredictorsassociatedwiththedataforwhichwewishtomakepredictions,labeledtest.Xbelow.
3.Avectorcontainingtheclasslabelsforthetrainingobservations,labeledtrain.Direction(train.Y)below.
4.AvalueforK,thenumberofnearestneighborstobeusedbytheclassifier.
Weusethecbind()function,shortforcolumnbind,tobindtheLag1andLag2variablestogetherintotwomatrices,oneforthetrainingsetandtheotherforthetestset.
Seed
Nowtheknn()functioncanbeusedtopredictthemarket’smovementforthedatesin2005.Wesetarandomseedbeforeweapplyknn()becauseifseveralobservationsaretiedasnearestneighbors,thenRwillrandomlybreakthetie.Therefore,aseedmustbesetinordertoensurereproducibilityofresults.
>knn.pred=knn(train.X,test.X,train.Direction,k=3)
>table(knn.pred,Direction.2005)
Direction.2005
knn.predDownUp
Down4854
Up6387
>mean(knn.pred==Direction.2005)
[1]0.5357143
resultsarebac,QDAisthebestforthistypeofdata
4.6.6AnApplicationtoCaravanInsuranceData
Caravandatasetincludes85predictorsthatmeasuredemographiccharacteristicsfor5,822individuals.TheresponsevariableisPurchase,whichindicateswhetherornotagivenindividualpurchasesacaravaninsurancepolicy.Inthisdataset,only6%ofpeoplepurchasedcaravaninsurance.
LimitationsonKNN
BecausetheKNNclassifierpredictstheclassofagiventestobservationby
identifyingtheobservationsthatarenearesttoit,thescaleofthevariables
matters.Anyvariablesthatareonalargescalewillhaveamuchlarger
effectonthedistancebetweentheobservations,andhenceontheKNN
classifier,thanvariablesthatareonasmallscale.
AsfarasKNNisconcerned,adifferenceof$1,000
insalaryisenormouscomparedtoadifferenceof50yearsinage.Consequently,
salarywilldrivetheKNNclassificationresults,andagewillhave
almostnoeffect.
Agoodwaytohandlethisproblemistostandardizethedatasothatallvariablesaregivenameanofzeroandastandarddeviationofone.weexcludecolumn86,becausethatisthequalitativePurchasevariable.
Wenowsplittheobservationsintoatestset,containingthefirst1,000
observations,andatrainingset,containingtheremainingobservations.
WefitaKNNmodelonthetrainingdatausingK=1,andevaluateits
performanceonthetestdata.
>library(ISLR) >names(Smarket) [1]"Year""Lag1""Lag2""Lag3""Lag4" [6]"Lag5""Volume""Today""Direction" >dim(Smarket) [1]12509
Thecor()functionproducesamatrixthatcontainsallofthepairwisecorrelationsamongthepredictorsinadataset.ThefirstcommandbelowgivesanerrormessagebecausetheDirectionvariableisqualitative.这个还挺有意思的
>cor(Smarket) Errorincor(Smarket):'x'mustbenumeric >cor(Smarket[,-9]) YearLag1Lag2Lag3Lag4 Year1.000000000.0296996490.0305964220.0331945810.035688718 Lag10.029699651.000000000-0.026294328-0.010803402-0.002985911 Lag20.03059642-0.0262943281.000000000-0.025896670-0.010853533 Lag30.03319458-0.010803402-0.0258966701.000000000-0.024051036 Lag40.03568872-0.002985911-0.010853533-0.0240510361.000000000 Lag50.02978799-0.005674606-0.003557949-0.018808338-0.027083641 Volume0.539006470.040909908-0.043383215-0.041823686-0.048414246 Today0.03009523-0.026155045-0.010250033-0.002447647-0.006899527 Lag5VolumeToday Year0.0297879950.539006470.030095229 Lag1-0.0056746060.04090991-0.026155045 Lag2-0.003557949-0.04338321-0.010250033 Lag3-0.018808338-0.04182369-0.002447647 Lag4-0.027083641-0.04841425-0.006899527 Lag51.000000000-0.02200231-0.034860083 Volume-0.0220023151.000000000.014591823 Today-0.0348600830.014591821.000000000
4.6.2LogisticRegression
Theglm()functionfitsgeneralizedglm()linearmodels,aclassofmodelsthatincludeslogisticregression.Thesyntax
generalizedoftheglm()functionissimilartothatoflm(),exceptthatwemustpassinlinearmodeltheargumentfamily=binomialinordertotellRtorunalogisticregressionratherthansomeothertypeofgeneralizedlinearmodel.
>glm.fit=glm(Direction∼Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial) >summary(glm.fit) Call: glm(formula=Direction~Lag1+Lag2+Lag3+Lag4+Lag5+ Volume,family=binomial,data=Smarket) DevianceResiduals: Min1QMedian3QMax -1.446-1.2031.0651.1451.326 Coefficients: EstimateStd.ErrorzvaluePr(>|z|) (Intercept)-0.1260000.240736-0.5230.601 Lag1-0.0730740.050167-1.4570.145 Lag2-0.0423010.050086-0.8450.398 Lag30.0110850.0499390.2220.824 Lag40.0093590.0499740.1870.851 Lag50.0103130.0495110.2080.835 Volume0.1354410.1583600.8550.392 (Dispersionparameterforbinomialfamilytakentobe1) Nulldeviance:1731.2on1249degreesoffreedom Residualdeviance:1727.6on1243degreesoffreedom AIC:1741.6 NumberofFisherScoringiterations:3
分析“
Thesmallestp-valuehereisassociatedwithLag1.Thenegativecoefficientforthispredictorsuggeststhatifthemarkethadapositivereturnyesterday,thenitislesslikelytogouptoday.However,atavalueof0.15,thep-valueisstillrelativelylarge,andsothereisnoclearevidenceofarealassociationbetweenLag1andDirection.
”
看具体的参数
coef()functioninordertoaccessjustthecoefficientsforthisfittedmodel.Wecanalsousethesummary()functiontoaccessparticularaspectsofthefittedmodel,suchasthep-valuesforthecoefficients.
>coef(glm.fit) (Intercept)Lag1Lag2Lag3Lag4 -0.126000257-0.073073746-0.0423013440.0110851080.009358938 Lag5Volume 0.0103130680.135440659 >summary(glm.fit)$coef EstimateStd.ErrorzvaluePr(>|z|) (Intercept)-0.1260002570.24073574-0.52339660.6006983 Lag1-0.0730737460.05016739-1.45659860.1452272 Lag2-0.0423013440.05008605-0.84457330.3983491 Lag30.0110851080.049938540.22197500.8243333 Lag40.0093589380.049974130.18727570.8514445 Lag50.0103130680.049511460.20829660.8349974 Volume0.1354406590.158359700.85527230.3924004 >
结果预测
Thepredict()functioncanbeusedtopredicttheprobabilitythatthemarketwillgoup,givenvaluesofthepredictors.
Thetype="response"optiontellsRtooutputprobabilitiesoftheformP(Y=1|X),asopposedtootherinformationsuchasthelogit.
>attach(Smarket) >glm.probs=predict(glm.fit,type="response") Inordertomakeapredictionastowhetherthemarketwillgoupor downonaparticularday,wemustconvertthesepredictedprobabilitiesintoclasslabels,UporDown. >contrasts(Direction) Up Down0 Up1
之后
Thefirstcommandcreatesavectorof1,250Downelements.ThesecondlinetransformstoUpalloftheelementsforwhichthepredictedprobabilityofamarketincreaseexceeds0.5.Giventhesepredictions,thetable()functiontable()canbeusedtoproduceaconfusionmatrixinordertodeterminehowmanyobservationswerecorrectlyorincorrectlyclassified.
>glm.pred=rep("Down",1250) >glm.pred[glm.probs>.5]="Up"
>table(glm.pred,Direction)
Direction
glm.predDownUp
Up457507
Down145141
Crossvalidationcreateaheldoutdatasetofobservationsfrom2005.
>train=(Year<2005)
>Smarket.2005=Smarket[!train,]
>Direction.2005=Direction[!train]
nowfitalogisticregressionmodelusingonlythesubsetoftheobservationsthatcorrespondtodatesbefore2005,usingthesubsetargument.Wethenobtainpredictedprobabilitiesofthestockmarketgoingupforeachofthedaysinourtestset—thatis,forthedaysin2005.
>glm.fit=glm(Direction∼Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial,subset=train)
混乱,不继续这部分了。
4.6.3LinearDiscriminantAnalysis
NowwewillperformLDAontheSmarketdata.InR,wefitaLDAmodelusingthelda()function,whichispartoftheMASSlibrary.
>library(MASS)
>lda.fit=lda(Direction∼Lag1+Lag2,data=Smarket,subset=train)
>lda.fit
Call:
lda(Direction~Lag1+Lag2,data=Smarket,subset=train)
Priorprobabilitiesofgroups:
DownUp
0.4919840.508016
Groupmeans:
Lag1Lag2
Down0.042790220.03389409
Up-0.03954635-0.03132544
Coefficientsoflineardiscriminants:
LD1
Lag1-0.6420190
Lag2-0.5135293
TheLDAoutputindicatesthatˆπ1=0.492andˆπ2=0.508;inotherwords,49.2%ofthetrainingobservationscorrespondtodaysduringwhichthemarketwentdown.Italsoprovidesthegroupmeans;thesearetheaverageofeachpredictorwithineachclass,andareusedbyLDAasestimatesofμk.Thesesuggestthatthereisatendencyfortheprevious2days’returnstobenegativeondayswhenthemarketincreases,andatendencyforthepreviousdays’returnstobepositiveondayswhenthemarketdeclines.ThecoefficientsoflineardiscriminantsoutputprovidesthelinearcombinationofLag1andLag2thatareusedtoformtheLDAdecisionrule.
If−0.642×Lag1−0.514×Lag2islarge,thentheLDAclassifierwillpredictamarketincrease,andifitissmall,thentheLDAclassifierwillpredictamarketdecline.Theplot()functionproducesplotsofthelineardiscriminants,obtainedbycomputing−0.642×Lag1−0.514×Lag2foreachofthetrainingobservations..
>lda.pred=predict(lda.fit,Smarket.2005)
>names(lda.pred)
[1]"class""posterior""x"
class,containsLDA’spredictionsaboutthemovementofthemarket.
Thesecondelement,posterior,isamatrixwhosekthcolumncontainsthe
posteriorprobabilitythatthecorrespondingobservationbelongstothekth
class,computedfrom(4.10).Finally,xcontainsthelineardiscriminants,
describedearlier.
>lda.class=lda.pred$class
>table(lda.class,Direction.2005)
Direction.2005
lda.classDownUp
Down3535
Up76106
4.6.4QuadraticDiscriminantAnalysis
WewillnowfitaQDAmodeltotheSmarketdata.QDAisimplementedinRusingtheqda()function,whichisalsopartoftheMASSlibrary.
>qda.fit=qda(Direction∼Lag1+Lag2,data=Smarket,subset=train)
>qda.fit
Call:
qda(Direction~Lag1+Lag2,data=Smarket,subset=train)
Priorprobabilitiesofgroups:
DownUp
0.4919840.508016
Groupmeans:
Lag1Lag2
Down0.042790220.03389409
Up-0.03954635-0.03132544
Theoutputcontainsthegroupmeans.Butitdoesnotcontainthecoefficientsofthelineardiscriminants,becausetheQDAclassifierinvolvesaquadratic,ratherthanalinear,functionofthepredictors.Thepredict()functionworksinexactlythesamefashionasforLDA.
4.6.5K-NearestNeighbors
performKNNusingtheknn()function,whichispartoftheclasslibrary.
Thefunctionrequiresfourinputs.
1.Amatrixcontainingthepredictorsassociatedwiththetrainingdata,labeledtrain.Xbelow.
2.Amatrixcontainingthepredictorsassociatedwiththedataforwhichwewishtomakepredictions,labeledtest.Xbelow.
3.Avectorcontainingtheclasslabelsforthetrainingobservations,labeledtrain.Direction(train.Y)below.
4.AvalueforK,thenumberofnearestneighborstobeusedbytheclassifier.
Weusethecbind()function,shortforcolumnbind,tobindtheLag1andLag2variablestogetherintotwomatrices,oneforthetrainingsetandtheotherforthetestset.
Seed
Nowtheknn()functioncanbeusedtopredictthemarket’smovementforthedatesin2005.Wesetarandomseedbeforeweapplyknn()becauseifseveralobservationsaretiedasnearestneighbors,thenRwillrandomlybreakthetie.Therefore,aseedmustbesetinordertoensurereproducibilityofresults.
>library(class)
>train.X=cbind(Lag1,Lag2)[train,]
>test.X=cbind(Lag1,Lag2)[!train,]
>train.Direction=Direction[train]
>set.seed(1)
>knn.pred=knn(train.X,test.X,train.Direction,k=3)
>table(knn.pred,Direction.2005)
Direction.2005
knn.predDownUp
Down4854
Up6387
>mean(knn.pred==Direction.2005)
[1]0.5357143
resultsarebac,QDAisthebestforthistypeofdata
4.6.6AnApplicationtoCaravanInsuranceData
Caravandatasetincludes85predictorsthatmeasuredemographiccharacteristicsfor5,822individuals.TheresponsevariableisPurchase,whichindicateswhetherornotagivenindividualpurchasesacaravaninsurancepolicy.Inthisdataset,only6%ofpeoplepurchasedcaravaninsurance.
LimitationsonKNN
BecausetheKNNclassifierpredictstheclassofagiventestobservationby
identifyingtheobservationsthatarenearesttoit,thescaleofthevariables
matters.Anyvariablesthatareonalargescalewillhaveamuchlarger
effectonthedistancebetweentheobservations,andhenceontheKNN
classifier,thanvariablesthatareonasmallscale.
AsfarasKNNisconcerned,adifferenceof$1,000
insalaryisenormouscomparedtoadifferenceof50yearsinage.Consequently,
salarywilldrivetheKNNclassificationresults,andagewillhave
almostnoeffect.
Agoodwaytohandlethisproblemistostandardizethedatasothatallvariablesaregivenameanofzeroandastandarddeviationofone.weexcludecolumn86,becausethatisthequalitativePurchasevariable.
standardized.X=scale(Caravan[,-86])
Wenowsplittheobservationsintoatestset,containingthefirst1,000
observations,andatrainingset,containingtheremainingobservations.
WefitaKNNmodelonthetrainingdatausingK=1,andevaluateits
performanceonthetestdata.
>test=1:1000
>train.X=standardized.X[-test,]
>test.X=standardized.X[test,]
>train.Y=Purchase[-test]
>test.Y=Purchase[test]
>set.seed(1)
>knn.pred=knn(train.X,test.X,train.Y,k=1)
相关文章推荐
- 基于C#解决OJ刷题之输入输出问题的总结(AKOJ1064-1071A+B问题汇总)
- 手机的web页面调用相机拍照上传
- VBS_动态数组详解
- wkhtmltopdf比较好用的html转pdf开源工具
- centos/redhat中文支持安装
- PHP学习 - 获取字符串子串
- [Leetcode]Kth Smallest Element in a BST
- 一个题目做了好久,java,想说爱你不容易!
- 手机安全卫士------查询号码归属地
- C. New Year Snowmen --优先队列
- MyGUI 解析
- 封装 链表
- cf 442B Andrey and Problem
- Cocos2d-x中文支持问题的解决办法
- python总结
- python总结
- Android官方命令深入分析之AVD Manager
- Android官方命令深入分析之AVD Manager
- 中国大学MOOC-翁恺-C语言程序设计习题集 08-2 到 10-2
- JAVA中泛型的本质