您的位置:首页 > 其它

ISLR 4.6 Lab: Logistic Regression, LDA, QDA, and KNN

2015-07-29 04:24 567 查看
4.6.1TheStockMarketData

>library(ISLR)
>names(Smarket)
[1]"Year""Lag1""Lag2""Lag3""Lag4"
[6]"Lag5""Volume""Today""Direction"
>dim(Smarket)
[1]12509


Thecor()functionproducesamatrixthatcontainsallofthepairwisecorrelationsamongthepredictorsinadataset.ThefirstcommandbelowgivesanerrormessagebecausetheDirectionvariableisqualitative.这个还挺有意思的

>cor(Smarket)
Errorincor(Smarket):'x'mustbenumeric
>cor(Smarket[,-9])
YearLag1Lag2Lag3Lag4
Year1.000000000.0296996490.0305964220.0331945810.035688718
Lag10.029699651.000000000-0.026294328-0.010803402-0.002985911
Lag20.03059642-0.0262943281.000000000-0.025896670-0.010853533
Lag30.03319458-0.010803402-0.0258966701.000000000-0.024051036
Lag40.03568872-0.002985911-0.010853533-0.0240510361.000000000
Lag50.02978799-0.005674606-0.003557949-0.018808338-0.027083641
Volume0.539006470.040909908-0.043383215-0.041823686-0.048414246
Today0.03009523-0.026155045-0.010250033-0.002447647-0.006899527
Lag5VolumeToday
Year0.0297879950.539006470.030095229
Lag1-0.0056746060.04090991-0.026155045
Lag2-0.003557949-0.04338321-0.010250033
Lag3-0.018808338-0.04182369-0.002447647
Lag4-0.027083641-0.04841425-0.006899527
Lag51.000000000-0.02200231-0.034860083
Volume-0.0220023151.000000000.014591823
Today-0.0348600830.014591821.000000000


4.6.2LogisticRegression

Theglm()functionfitsgeneralizedglm()linearmodels,aclassofmodelsthatincludeslogisticregression.Thesyntax
generalizedoftheglm()functionissimilartothatoflm(),exceptthatwemustpassinlinearmodeltheargumentfamily=binomialinordertotellRtorunalogisticregressionratherthansomeothertypeofgeneralizedlinearmodel.

>glm.fit=glm(Direction∼Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial)
>summary(glm.fit)

Call:
glm(formula=Direction~Lag1+Lag2+Lag3+Lag4+Lag5+
Volume,family=binomial,data=Smarket)

DevianceResiduals:
Min1QMedian3QMax
-1.446-1.2031.0651.1451.326

Coefficients:
EstimateStd.ErrorzvaluePr(>|z|)
(Intercept)-0.1260000.240736-0.5230.601
Lag1-0.0730740.050167-1.4570.145
Lag2-0.0423010.050086-0.8450.398
Lag30.0110850.0499390.2220.824
Lag40.0093590.0499740.1870.851
Lag50.0103130.0495110.2080.835
Volume0.1354410.1583600.8550.392

(Dispersionparameterforbinomialfamilytakentobe1)

Nulldeviance:1731.2on1249degreesoffreedom
Residualdeviance:1727.6on1243degreesoffreedom
AIC:1741.6

NumberofFisherScoringiterations:3


分析“

Thesmallestp-valuehereisassociatedwithLag1.Thenegativecoefficientforthispredictorsuggeststhatifthemarkethadapositivereturnyesterday,thenitislesslikelytogouptoday.However,atavalueof0.15,thep-valueisstillrelativelylarge,andsothereisnoclearevidenceofarealassociationbetweenLag1andDirection.



看具体的参数

coef()functioninordertoaccessjustthecoefficientsforthisfittedmodel.Wecanalsousethesummary()functiontoaccessparticularaspectsofthefittedmodel,suchasthep-valuesforthecoefficients.

>coef(glm.fit)
(Intercept)Lag1Lag2Lag3Lag4
-0.126000257-0.073073746-0.0423013440.0110851080.009358938
Lag5Volume
0.0103130680.135440659
>summary(glm.fit)$coef
EstimateStd.ErrorzvaluePr(>|z|)
(Intercept)-0.1260002570.24073574-0.52339660.6006983
Lag1-0.0730737460.05016739-1.45659860.1452272
Lag2-0.0423013440.05008605-0.84457330.3983491
Lag30.0110851080.049938540.22197500.8243333
Lag40.0093589380.049974130.18727570.8514445
Lag50.0103130680.049511460.20829660.8349974
Volume0.1354406590.158359700.85527230.3924004
>


结果预测

Thepredict()functioncanbeusedtopredicttheprobabilitythatthemarketwillgoup,givenvaluesofthepredictors.

Thetype="response"optiontellsRtooutputprobabilitiesoftheformP(Y=1|X),asopposedtootherinformationsuchasthelogit.

>attach(Smarket)
>glm.probs=predict(glm.fit,type="response")

Inordertomakeapredictionastowhetherthemarketwillgoupor
downonaparticularday,wemustconvertthesepredictedprobabilitiesintoclasslabels,UporDown.

>contrasts(Direction)
Up
Down0
Up1


之后

Thefirstcommandcreatesavectorof1,250Downelements.ThesecondlinetransformstoUpalloftheelementsforwhichthepredictedprobabilityofamarketincreaseexceeds0.5.Giventhesepredictions,thetable()functiontable()canbeusedtoproduceaconfusionmatrixinordertodeterminehowmanyobservationswerecorrectlyorincorrectlyclassified.

>glm.pred=rep("Down",1250)
>glm.pred[glm.probs>.5]="Up"

>table(glm.pred,Direction)
Direction
glm.predDownUp
Up457507
Down145141



Crossvalidationcreateaheldoutdatasetofobservationsfrom2005.

>train=(Year<2005)
>Smarket.2005=Smarket[!train,]
>Direction.2005=Direction[!train]


nowfitalogisticregressionmodelusingonlythesubsetoftheobservationsthatcorrespondtodatesbefore2005,usingthesubsetargument.Wethenobtainpredictedprobabilitiesofthestockmarketgoingupforeachofthedaysinourtestset—thatis,forthedaysin2005.

>glm.fit=glm(Direction∼Lag1+Lag2+Lag3+Lag4+Lag5+Volume,data=Smarket,family=binomial,subset=train)


混乱,不继续这部分了。

4.6.3LinearDiscriminantAnalysis

NowwewillperformLDAontheSmarketdata.InR,wefitaLDAmodelusingthelda()function,whichispartoftheMASSlibrary.

>library(MASS)
>lda.fit=lda(Direction∼Lag1+Lag2,data=Smarket,subset=train)
>lda.fit
Call:
lda(Direction~Lag1+Lag2,data=Smarket,subset=train)

Priorprobabilitiesofgroups:
DownUp
0.4919840.508016

Groupmeans:
Lag1Lag2
Down0.042790220.03389409
Up-0.03954635-0.03132544

Coefficientsoflineardiscriminants:
LD1
Lag1-0.6420190
Lag2-0.5135293


TheLDAoutputindicatesthatˆπ1=0.492andˆπ2=0.508;inotherwords,49.2%ofthetrainingobservationscorrespondtodaysduringwhichthemarketwentdown.Italsoprovidesthegroupmeans;thesearetheaverageofeachpredictorwithineachclass,andareusedbyLDAasestimatesofμk.Thesesuggestthatthereisatendencyfortheprevious2days’returnstobenegativeondayswhenthemarketincreases,andatendencyforthepreviousdays’returnstobepositiveondayswhenthemarketdeclines.ThecoefficientsoflineardiscriminantsoutputprovidesthelinearcombinationofLag1andLag2thatareusedtoformtheLDAdecisionrule.

If−0.642×Lag1−0.514×Lag2islarge,thentheLDAclassifierwillpredictamarketincrease,andifitissmall,thentheLDAclassifierwillpredictamarketdecline.Theplot()functionproducesplotsofthelineardiscriminants,obtainedbycomputing−0.642×Lag1−0.514×Lag2foreachofthetrainingobservations..

>lda.pred=predict(lda.fit,Smarket.2005)
>names(lda.pred)
[1]"class""posterior""x"


class,containsLDA’spredictionsaboutthemovementofthemarket.
Thesecondelement,posterior,isamatrixwhosekthcolumncontainsthe
posteriorprobabilitythatthecorrespondingobservationbelongstothekth
class,computedfrom(4.10).Finally,xcontainsthelineardiscriminants,
describedearlier.

>lda.class=lda.pred$class
>table(lda.class,Direction.2005)
Direction.2005
lda.classDownUp
Down3535
Up76106


4.6.4QuadraticDiscriminantAnalysis

WewillnowfitaQDAmodeltotheSmarketdata.QDAisimplementedinRusingtheqda()function,whichisalsopartoftheMASSlibrary.

>qda.fit=qda(Direction∼Lag1+Lag2,data=Smarket,subset=train)
>qda.fit
Call:
qda(Direction~Lag1+Lag2,data=Smarket,subset=train)

Priorprobabilitiesofgroups:
DownUp
0.4919840.508016

Groupmeans:
Lag1Lag2
Down0.042790220.03389409
Up-0.03954635-0.03132544


Theoutputcontainsthegroupmeans.Butitdoesnotcontainthecoefficientsofthelineardiscriminants,becausetheQDAclassifierinvolvesaquadratic,ratherthanalinear,functionofthepredictors.Thepredict()functionworksinexactlythesamefashionasforLDA.

4.6.5K-NearestNeighbors

performKNNusingtheknn()function,whichispartoftheclasslibrary.

Thefunctionrequiresfourinputs.

1.Amatrixcontainingthepredictorsassociatedwiththetrainingdata,labeledtrain.Xbelow.
2.Amatrixcontainingthepredictorsassociatedwiththedataforwhichwewishtomakepredictions,labeledtest.Xbelow.
3.Avectorcontainingtheclasslabelsforthetrainingobservations,labeledtrain.Direction(train.Y)below.
4.AvalueforK,thenumberofnearestneighborstobeusedbytheclassifier.

Weusethecbind()function,shortforcolumnbind,tobindtheLag1andLag2variablestogetherintotwomatrices,oneforthetrainingsetandtheotherforthetestset.

Seed

Nowtheknn()functioncanbeusedtopredictthemarket’smovementforthedatesin2005.Wesetarandomseedbeforeweapplyknn()becauseifseveralobservationsaretiedasnearestneighbors,thenRwillrandomlybreakthetie.Therefore,aseedmustbesetinordertoensurereproducibilityofresults.

>library(class)
>train.X=cbind(Lag1,Lag2)[train,]
>test.X=cbind(Lag1,Lag2)[!train,]
>train.Direction=Direction[train]
>set.seed(1)

>knn.pred=knn(train.X,test.X,train.Direction,k=3)
>table(knn.pred,Direction.2005)
Direction.2005
knn.predDownUp
Down4854
Up6387

>mean(knn.pred==Direction.2005)
[1]0.5357143

resultsarebac,QDAisthebestforthistypeofdata

4.6.6AnApplicationtoCaravanInsuranceData

Caravandatasetincludes85predictorsthatmeasuredemographiccharacteristicsfor5,822individuals.TheresponsevariableisPurchase,whichindicateswhetherornotagivenindividualpurchasesacaravaninsurancepolicy.Inthisdataset,only6%ofpeoplepurchasedcaravaninsurance.

LimitationsonKNN

BecausetheKNNclassifierpredictstheclassofagiventestobservationby
identifyingtheobservationsthatarenearesttoit,thescaleofthevariables
matters.Anyvariablesthatareonalargescalewillhaveamuchlarger
effectonthedistancebetweentheobservations,andhenceontheKNN
classifier,thanvariablesthatareonasmallscale.

AsfarasKNNisconcerned,adifferenceof$1,000
insalaryisenormouscomparedtoadifferenceof50yearsinage.Consequently,
salarywilldrivetheKNNclassificationresults,andagewillhave
almostnoeffect.

Agoodwaytohandlethisproblemistostandardizethedatasothatallvariablesaregivenameanofzeroandastandarddeviationofone.weexcludecolumn86,becausethatisthequalitativePurchasevariable.

standardized.X=scale(Caravan[,-86])


Wenowsplittheobservationsintoatestset,containingthefirst1,000
observations,andatrainingset,containingtheremainingobservations.
WefitaKNNmodelonthetrainingdatausingK=1,andevaluateits
performanceonthetestdata.

>test=1:1000
>train.X=standardized.X[-test,]
>test.X=standardized.X[test,]
>train.Y=Purchase[-test]
>test.Y=Purchase[test]
>set.seed(1)
>knn.pred=knn(train.X,test.X,train.Y,k=1)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: