您的位置:首页 > 其它

R语言绘制heatmap热图

2013-03-16 11:39 246 查看
From:http://bbsunchen.iteye.com/blog/1271580

介绍如何使用R绘制heatmap的文章。
今天无意间在Flowingdata看到一篇关于如何使用R来做heatmap的文章(请移步到这里)。虽然
heatmap只是R中一个很普通的图形函数,但这个例子使用了2008-2009赛季NBA50个顶级球员数据做了一个极佳的演示,效果非常不错。对R大致了解的童鞋可以直接在Rconsole上敲
?heatmap
直接查看帮助即可。
没有接触过R的童鞋继续围观,下面会仔细介绍如何使用R实现NBA50位顶级球员指标表现热图:
关于heatmap,中文一般翻译为“热图”,其统计意义wiki上解释的很清楚:

Aheatmapisagraphicalrepresentationofdatawherethevaluestakenbya

variableinatwo-dimensionalmaparerepresentedascolors.Heatmapsoriginatedin2Ddisplaysofthevaluesinadatamatrix.Largervalueswererepresentedbysmalldarkgrayorblacksquares(pixels)andsmallervalues
bylightersquares.

下面这个图即是Flowingdata用一些
R函数对2008-2009赛季NBA50名顶级球员指标做的一个热图(点击参看大图):




先解释一下数据:
这里共列举了50位球员,估计爱好篮球的童鞋对上图右边的每个名字都会耳熟能详。这些球员每个人会有19个指标,包括打了几场球(G)、上场几分钟(MIN)、得分(PTS)……这样就行成了一个50行×19列的矩阵。但问题是,数据有些多,需要使用一种比较好的办法来展示,Soitcomes,heatmap!
简单的说明:
比如从上面的热图上观察得分前3名(Wade、James、Bryant)PTS、FGM、FGA比较高,但Bryant的FTM、FTA和前两者就差一些;Wade在这三人中STL是佼佼者;而James的DRB和TRB又比其他两人好一些……
姚明的3PP(3PointsPercentage)这条数据很有意思,非常出色!仔细查了一下这个数值,居然是100%。仔细回想一下,似乎那个赛季姚明好像投过一个3分,并且中了,然后再也没有3p。这样本可真够小的!
最后是如何做这个热图(做了些许修改):
Step0.DownloadR
R官网:http://www.r-project.org,它是免费的。官网上面提供了Windows,Mac,Linux版本(或源代码)的R程序。
Step1.Loadthedata
R可以支持网络路径,使用读取csv文件的函数read.csv。
读取数据就这么简单:

nba<-read.csv("http://datasets.flowingdata.com/ppg2008.csv",sep=",")


Step2.Sortdata
按照球员得分,将球员从小到大排序:

nba<-nba[order(nba$PTS),]


当然也可以选择MIN,BLK,STL之类指标
Step3.Preparedata
把行号换成行名(球员名称):

row.names(nba)<-nba$Name


去掉第一列行号:

nba<-nba[,2:20]#ornba<-nba[,-1]


Step4.Preparedata,again
把dataframe转化为我们需要的矩阵格式:

nba_matrix<-data.matrix(nba)


Step5.Makeaheatmap
#R的默认还会在图的左边和上边绘制dendrogram,使用Rowv=NA,Colv=NA去掉

heatmap(nba_matrix,Rowv=NA,Colv=NA,col=cm.colors(256),revC=FALSE,scale='column')


这样就得到了上面的那张热图。
Step6.Colorselection
或者想把热图中的颜色换一下:

heatmap(nba_matrix,Rowv=NA,Colv=NA,col=heat.colors(256),revC=FALSE,scale="column",margins=c(5,10))


BioinformaticsandComputationalBiologySolutionsUsingRandBioconductor第10章的

例子:

Heatmaps,orfalsecolorimageshaveareasonablylonghistory,ashasthe
notionofrearrangingthecolumnsandrowstoshowstructureinthedata.
TheywereappliedtomicroarraydatabyEisenetal.(1998)andhave
becomeastandardvisualizationmethodforthistypeofdata.
Aheatmapisatwo-dimensional,rectangular,coloredgrid.Itdisplays
datathatthemselvescomeintheformofarectangularmatrix.Thecolor
ofeachrectangleisdeterminedbythevalueofthecorrespondingentry
inthematrix.Therowsandcolumnsofthematrixcanberearranged
independently.Usuallytheyarereorderedsothatsimilarrowsareplaced
nexttoeachother,andthesameforcolumns.Amongtheorderingsthat
arewidelyusedarethosederivedfromahierarchicalclustering,butmany
otherorderingsarepossible.Ifhierarchicalclusteringisused,thenitis
customarythatthedendrogramsareprovidedaswell.Inmanycasesthe
resultingimagehasrectangularregionsthatarerelativelyhomogeneous
andhencethegraphiccanaidindeterminingwhichrows(generallythe
genes)havesimilarexpressionvalueswithinwhichsubgroupsofsamples
(generallythecolumns).
Thefunctionheatmapisanimplementationwithmanyoptions.Inparticular,
userscancontroltheorderingofrowsandcolumnsindependently
fromeachother.Theycanuserowandcolumnlabelsoftheirownchoosing
orselecttheirowncolorscheme.


>library("ALL")

>data("ALL")

>selSamples<-ALL$mol.biol%in%c("ALL1/AF4",

+"E2A/PBX1")

>ALLs<-ALL[,selSamples]

>ALLs$mol.biol<-factor(ALLs$mol.biol)

>colnames(exprs(ALLs))<-paste(ALLs$mol.biol,

+colnames(exprs(ALLs)))

>library("genefilter")

>meanThr<-log2(100)

>g<-ALLs$mol.biol

>s1<-rowMeans(exprs(ALLs)[,g==levels(g)[1]])>

+meanThr

>s2<-rowMeans(exprs(ALLs)[,g==levels(g)[2]])>

+meanThr

>s3<-rowttests(ALLs,g)$p.value<2e-04

>selProbes<-(s1|s2)&s3

>ALLhm<-ALLs[selProbes,]

>library(RColorBrewer)

>hmcol<-colorRampPalette(brewer.pal(10,"RdBu"))(256)

>spcol<-ifelse(ALLhm$mol.biol=="ALL1/AF4",

+"goldenrod","skyblue")

>heatmap(exprs(ALLhm),col=hmcol,ColSideColors=spcol)

最后,可以

>help(heatmap)查找帮助,看看帮助给提供的例子

也可以看这的例子:

http://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/r/heatmap/

UsingRtodrawaHeatmapfromMicroarrayData
[c]
The
firstsectionofthispageusesRtoanalyseanAcutelymphocyticleukemia(ALL)microarraydataset,producingaheatmap(withdendrograms)ofgenesdifferentiallyexpressedbetweentwotypesofleukemia.

Thereisafollowonpagedealingwithhowtodothis
fromPythonusingRPy.
Theoriginalcitationfortherawdatais"GeneexpressionprofileofadultT-cellacutelymphocyticleukemiaidentifiesdistinctsubsetsofpatientswithdifferentresponsetotherapyandsurvival"byChiarettietal.Blood2004.
(PMID:14684422)

Theanalysisisa"stepbystep"recipebasedonthispaper,
Bioconductor:opensoftwaredevelopmentforcomputationalbiologyandbioinformatics,Gentlemanetal.2004.Their
Figure2Heatmap,whichwerecreate,isreproducedhere:



HeatmapsfromR

AssumingyouhavearecentversionofR(from
TheRProject)and
BioConductor(see
WindowsXPinstallationinstructions),theexampledatasetcanbedownloadedastheBioConductorALLpackage.

YoushouldbeabletoinstallthisfromwithinRasfollows:

>source("http://www.bioconductor.org/biocLite.R")
>biocLite("ALL")

RunningbioCLiteversion0.1withRversion2.1.1
...

Alternatively,youcandownloadthepackagebyhandfrom
hereor

here.
IfyouareusingWindows,downloadALL_1.0.2.zip(orsimilar)andsaveit.ThenfromwithintheRprogram,usethemenuoption"Packages","Installpackage(s)fromlocalzipfiles..."andselecttheZIPfile.

OnLinux,downloadALL_1.0.2.tar.gz(orsimilar)anduse
sudoRCMDINSTALLALL_1.0.2.tar.gzatthecommandprompt.
Withthatoutoftheway,youshouldbeabletostartRandloadthispackagewiththe
libraryanddatacommands:
>library("ALL")
Loadingrequiredpackage:Biobase
Loadingrequiredpackage:tools
WelcometoBioconductor
Vignettescontainintroductorymaterial.Toview,
simplytype:openVignette()
Fordetailsonreadingvignettes,see
theopenVignettehelppage.
>data("ALL")

IfyouinspecttheresultingALLvariable,itcontains128sampleswith12625genes,andassociatedphenotypicdata.

>ALL
ExpressionSet(exprSet)with
12625genes
128samples
phenoDataobjectwith21variablesand128cases
varLabels
cod:PatientID
diagnosis:Dateofdiagnosis
sex:Genderofthepatient
age:Ageofthepatientatentry
BT:doesthepatienthaveB-cellorT-cellALL
remission:Completeremission(CR),refractory(REF)orNA.DerivedfromCR
CR:Originalremissondata
date.cr:Datecompleteremissionifachieved
t(4;11):didthepatienthavet(4;11)translocation.Derivedfromcitog
t(9;22):didthepatienthavet(9;22)translocation.Derivedfromcitog
cyto.normal:Wascytogenetictestnormal?Derivedfromcitog
citog:originalcitogeneticsdata,deletionsort(4;11),t(9;22)status
mol.biol:molecularbiology
fusionprotein:whichofp190,p210orp190/210forbcr/able
mdr:multi-drugresistant
kinet:ploidy:eitherdiploidorhyperd.
ccr:Continuouscompleteremission?Derivedfromf.u
relapse:Relapse?Derivedfromf.u
transplant:didthepatientreceiveabonemarrowtransplant?Derivedfromf.u
f.u:followupdataavailable
datelastseen:datepatientwaslastseen

Wecanlooksattheresultsofmolecularbiologytestingforthe128samples:

>ALL$mol.biol
[1]BCR/ABLNEGBCR/ABLALL1/AF4NEGNEGNEGNEGNEG
[10]BCR/ABLBCR/ABLNEGE2A/PBX1NEGBCR/ABLNEGBCR/ABLBCR/ABL
[19]BCR/ABLBCR/ABLNEGBCR/ABLBCR/ABLNEGALL1/AF4BCR/ABLALL1/AF4
...
[127]NEGNEG
Levels:ALL1/AF4BCR/ABLE2A/PBX1NEGNUP-98p15/p16

Ignoringthesampleswhichcamebacknegativeonthistest(NEG),mosthavebeenclassifiedashavingatranslocationbetweenchromosomes9and22(BCR/ABL),oratranslocationbetweenchromosomes4and11(ALL1/AF4).

Forthepurposesofthisexample,weareonlyinterestedinthesetwosubgroups,sowewillcreateafilteredversionofthedatasetusingthisasaselectioncriteria:

>eset<-ALL[,ALL$mol.biol%in%c("BCR/ABL","ALL1/AF4")]

Theresultingvariable,eset,containsjust47samples-eachwiththefull12,625geneexpressionlevels.

Thisisfartoomuchdatatodrawaheatmapwith,butwecandooneforthefirst100genesasfollows:

>heatmap(exprs(eset[1:100,]))

AccordingtotheBioConductorpaperwearefollowing,thenextstepintheanalysiswastousethelmFitfunction(fromthelimmapackage)tolookforgenesdifferentiallyexpressedbetweenthetwogroups.Thefittedmodelobject
isfurtherprocessedbytheeBayesfunctiontoproduceempiricalBayesteststatisticsforeachgene,includingmoderatedt-statistics,p-valuesandlog-oddsofdifferentialexpression.

>library(limma)
>f<-factor(as.character(eset$mol.biol))
>design<-model.matrix(~f)
>fit<-eBayes(lmFit(eset,design))

Ifthelimmapackageisn'tinstalled,you'llneedtoinstallitfirst:
>source("http://www.bioconductor.org/biocLite.R")
>biocLite("limma")

RunningbioCLiteversion0.1withRversion2.1.1
...

Wecannowreproduce
Figure1fromthepaper.
>topTable(fit,coef=2)
IDMAtP.ValueB
10161914_at-3.0762314.611284-27.498605.887581e-2756.32653
788437809_at-3.9719064.864721-19.754781.304570e-2044.23832
693936873_at-3.3916624.284529-19.614971.768670e-2043.97298
1086540763_at-3.0869923.474092-17.007397.188381e-1838.64615
425034210_at3.6181948.43848215.456553.545401e-1635.10692
1155641448_at-2.5004883.733012-14.839241.802456e-1533.61391
338933358_at-2.2697305.191015-12.963983.329289e-1328.76471
805437978_at-1.0360516.937965-10.487776.468996e-1021.60216
1057940480_s_at1.8449987.82690010.382149.092033e-1021.27732
3301307_at1.5839044.63888510.257311.361875e-0920.89145

Theleftmostnumbersarerowindices,IDistheAffymetrixHGU95av2accessionnumber,Misthelogratioofexpression,Aisthelogaverageexpression,tthemoderatedt-statistic,andBisthelogoddsofdifferentialexpression.

Next,weselectthosegenesthathaveadjustedp-valuesbelow0.05,usingaverystringentHolmmethodtoselectasmallnumber(165)ofgenes.

>selected<-p.adjust(fit$p.value[,2])<0.05
>esetSel<-eset[selected,]

ThevariableesetSelhasdataon(only)165genesforall47samples.Wecaneasilyproduceaheatmapasfollows(inR-2.1.1thisdefaultstoayellow/red"heat"colourscheme):

>heatmap(exprs(esetSel))




Ifyouhavethetopographicalcoloursinstalled(yellow-green-blue),youcandothis:

>heatmap(exprs(esetSel),col=topo.colors(100))




Thisisgettingverycloseto
Gentlemanetal.'sFigure2,excepttheyhaveaddedared/bluebanneracrossthetoptoreallyemphasizehowthehierarchicalclusteringhascorrectlysplitthedataintothetwogroups(10and37patients).

Todothat,wecanusetheheatmapfunction'soptionalargumentofColSideColors.IcreatedasmallfunctiontomaptheeselSet$mol.biolvaluestored(#FF0000)andblue(#0000FF),whichwecanapplytoeachofthemolecularbiology
resultstogetamatchinglistofcoloursforourcolumns:
>color.map<-function(mol.biol){if(mol.biol=="ALL1/AF4")"#FF0000"else"#0000FF"}
>patientcolors<-unlist(lapply(esetSel$mol.bio,color.map))
>heatmap(exprs(esetSel),col=topo.colors(100),ColSideColors=patientcolors)




Looksprettyclosenow,doesn'tit:



Torecap,thisis"all"weneededtotypeintoRtoachievethis:

library("ALL")
data("ALL")
eset<-ALL[,ALL$mol.biol%in%c("BCR/ABL","ALL1/AF4")]
library("limma")
f<-factor(as.character(eset$mol.biol))
design<-model.matrix(~f)
fit<-eBayes(lmFit(eset,design))
selected<-p.adjust(fit$p.value[,2])<0.05
esetSel<-eset[selected,]
color.map<-function(mol.biol){if(mol.biol=="ALL1/AF4")"#FF0000"else"#0000FF"}
patientcolors<-unlist(lapply(esetSel$mol.bio,color.map))
heatmap(exprs(esetSel),col=topo.colors(100),ColSideColors=patientcolors)


HeatmapsinR-MoreOptions

Onesubtlepointinthepreviousexamplesisthattheheatmapfunctionhasautomaticallyscaledthecoloursforeachrow(i.e.eachgenehasbeenindividuallynormalisedacrosspatients).Thiscanbedisabledusingscale="none",which
youmightwanttodoifyouhavealreadydoneyourownnormalisation(orthismaynotbeappropriateforyourdata):

heatmap(exprs(esetSel),col=topo.colors(75),scale="none",ColSideColors=patientcolors,cexRow=0.5)



Youmightalsohavenoticedintheabovesnippet,thatIhaveshrunktherowcaptionswhichweresobigtheyoverlappedeachother.TherelevantoptionsarecexRowandcexCol.

Sofarsogood-butwhatifyouwantedakeytothecoloursshown?Theheatmapfunctiondoesn'tofferthis,butthegoodnewsisthatheatmap.2fromthegplotslibrarydoes.Infact,itoffersalotofotherfeatures,manyofwhich
Ideliberatelyturnoffinthefollowingexample:
library("gplots")
heatmap.2(exprs(esetSel),col=topo.colors(75),scale="none",ColSideColors=patientcolors,
key=TRUE,symkey=FALSE,density.info="none",trace="none",cexRow=0.5)




Bydefault,heatmap.2willalsoshowatraceoneachdatapoint(removedthiswithtrace="none").Ifyouaskforakey(usingkey=TRUE)thisfunctionwillactuallygiveyouacombined"colorkeyandhistogram",butthatcanbeoverridden
(withdensity.info="none").
Don'tlikethecolourscheme?Tryusingthefunctionsbluered/redblueforared-white-bluespread,orredgreen/greenredforthered-black-greencolourschemeoftenusedwithtwo-colourmicroarrays:

library("gplots")
heatmap.2(exprs(esetSel),col=redgreen(75),scale="row",ColSideColors=patientcolors,
key=TRUE,symkey=FALSE,density.info="none",trace="none",cexRow=0.5)




HeatmapsfromPython

So,howcanwedothatfromwithinPython?Onewayisusing
RPy(RfromPython),andthisisdiscussedon

thispage.
P.S.Ifyouwanttouseheatmap.2fromwithinpythonusingRPy,usethesyntaxheatmap_2duetothedifferencesinhowRandPythonhandlefullstopsandunderscores.

Whataboutothermicroarraydata?

Well,Ihavealsodocumentedhowyoucan
loadNCBIGEOSOFTfilesintoRasaBioConductorexpressionsetobject.AslongasyoucangetyourdataintoRasamatrixordataframe,convertingitintoanexprSetshouldn'tbetoohard.

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: