glmnetUtils: quality of life enhancements for elastic net regression with glmnet
2016-11-03 13:33
363 查看
The glmnetUtils package provides a collection of tools to streamline the process of fitting elastic net models with glmnet. I wrote the package after a couple of projects where I found myself writing the same boilerplate code to convert a data frame into a predictor matrix and a response vector. In addition to providing a formula interface, it also has a function (
Under the hood, glmnetUtils creates a model matrix and response vector, and passes them to the glmnet package to do the actual model fitting. Prediction also works as you'd expect: just pass a data frame containing the new observations, along with any arguments that
The standard R method for creating a model matrix out of a data frame uses the
Another issue with the standard approach is the treatment of factors. Normally,
To deal with these problems, glmnetUtils by default will avoid using
glmnetUtils can also generate a sparse model matrix, using the
Via
Via
A more detailed version of this post can also be found at the package vignette. If you find a bug, or if you want to suggest improvements to the package, please feel free to contact me athongooi@microsoft.com.
转自:http://blog.revolutionanalytics.com/2016/11/glmnetutils.html
cvAlpha.glmnet) to do crossvalidation for both elastic net parameters α and λ, as well as some utility functions.
The formula interface
The interface that glmnetUtils provides is very much the same as for most modelling functions in R. To fit a model, you provide a formula and data frame. You can also provide any arguments that glmnet will accept. Here is a simple example:mtcarsMod <- glmnet(mpg ~ cyl + disp + hp, data=mtcars) ## Call: ## glmnet.formula(formula = mpg ~ cyl + disp + hp, data = mtcars) ## ## Model fitting options: ## Sparse model matrix: FALSE ## Use model.frame: FALSE ## Alpha: 1 ## Lambda summary: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.03326 0.11690 0.41000 1.02800 1.44100 5.05500
Under the hood, glmnetUtils creates a model matrix and response vector, and passes them to the glmnet package to do the actual model fitting. Prediction also works as you'd expect: just pass a data frame containing the new observations, along with any arguments that
predict.glmnetneeds.
# least squares regression: get predictions for lambda=1 predict(mtcarsMod, newdata=mtcars, s=1)
Building the model matrix
You may have noticed the options "use model.frame" and "sparse model matrix" in the printed output above. glmnetUtils includes a couple of options to improve performance, especially on wide datasets and/or have many categorical (factor) variables.The standard R method for creating a model matrix out of a data frame uses the
model.framefunction, which has a major disadvantage when it comes to wide data. It generates a termsobject, which specifies how the original columns of data relate to the columns in the model matrix. This involves creating and storing a (roughly) square matrix of size p × p, where p is the number of variables in the model. When p > 10000, which isn't uncommon these days, the terms object can exceed a gigabyte in size. Even if there is enough memory to store the object, processing it can be very slow.
Another issue with the standard approach is the treatment of factors. Normally,
model.matrixwill turn an N-level factor into an indicator matrix with N−1 columns, with one column being dropped. This is necessary for unregularised models as fit with
lmand
glm, since the full set of Ncolumns is linearly dependent. However, this may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.
To deal with these problems, glmnetUtils by default will avoid using
model.frame, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be much faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually). Machine learners may also recognise this as one-hot encoding.
glmnetUtils can also generate a sparse model matrix, using the
sparse.model.matrixfunction provided in the Matrix package. This works exactly the same as a regular model matrix, but takes up significantly less memory if many of its entries are zero. A scenario where this is the case would be where many of the predictors are factors, each with a large number of levels.
Crossvalidation for α
One piece missing from the standard glmnet package is a way of choosing α, the elastic net mixing parameter, similar to howcv.glmnetchooses λ, the shrinkage parameter. To fix this, glmnetUtils provides the
cvAlpha.glmnetfunction, which uses crossvalidation to examine the impact on the model of changing α and λ. The interface is the same as for the other functions:
# Leukemia dataset from Trevor Hastie's website: # http://web.stanford.edu/~hastie/glmnet/glmnetData/Leukemia.RData load("~/Leukemia.rdata") leuk <- do.call(data.frame, Leukemia) cvAlpha.glmnet(y ~ ., data=leuk, family="binomial") ## Call: ## cvAlpha.glmnet.formula(formula = y ~ ., data = leuk, family = "binomial") ## ## Model fitting options: ## Sparse model matrix: FALSE ## Use model.frame: FALSE ## Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1 ## Number of crossvalidation folds for lambda: 10
cvAlpha.glmnetuses the algorithm described in the help for
cv.glmnet, which is to fix the distribution of observations across folds and then call
cv.glmnetin a loop with different values of α. Optionally, you can parallelise this outer loop, by setting the
outerParallelargument to a non-NULL value. Currently, glmnetUtils supports the following methods of parallelisation:
Via
parLapplyin the parallel package. To use this, set
outerParallelto a valid cluster object created by
makeCluster.
Via
rxExecas supplied by Microsoft R Server’s RevoScaleR package. To use this, set
outerParallelto a valid compute context created by
RxComputeContext, or a character string specifying such a context.
Conclusion
The glmnetUtils package is a way to improve quality of life for users of glmnet. As with many R packages, it’s always under development; you can get the latest version from my GitHub repo. The easiest way to install it is via devtools:library(devtools) install_github("hong-revo/glmnetUtils")
A more detailed version of this post can also be found at the package vignette. If you find a bug, or if you want to suggest improvements to the package, please feel free to contact me athongooi@microsoft.com.
转自:http://blog.revolutionanalytics.com/2016/11/glmnetutils.html
相关文章推荐
- More 3D Graphics (rgl) for Classification with Local Logistic Regression and Kernel Density Estimates (from The Elements of Statistical Learning)(转)
- RSOD8-8% off code to buy old school gold from RSorder Until Dec. 2 for Quality of Life & Bugfixes
- Some 3D Graphics (rgl) for Classification with Splines and Logistic Regression (from The Elements of Statistical Learning)(转)
- Reducing the Dimensionality of data with neural networks / A fast learing algorithm for deep belief net
- ASP.NET操作EXCEL时出现的错误 Retrieving the COM class factory for component with CLSID
- ASP.NET操作EXCEL时出现的错误 Retrieving the COM class factory for component with CLSID(转)
- Unify the Role-Based Security Models for Enterprise and Application Domains with .NET
- For the first time of my life, I'm afraid!
- The ups and downs of life with Linus
- PRB:WebGrid column headers do not align with their columns in NetAdvantage for .NET 2007 Volume 3
- D-Day +20 of .NET 实习!AMT!实习!AMT!Ready for Action!
- Get a substitute name of 'POJO' for .NET
- Using the ASP.Net Runtime for extending desktop applications with dynamic HTML Scripts
- fckeditorWithPlayerFor.NET
- 使用ActiveReport for .net 进行报表开发(一)(转自Cure The Last Day Of Summer)
- ASP.NET操作EXCEL时出现的错误 Retrieving the COM class factory for component with CLSID
- The name or security ID (SID) of the domain specified is inconsistent with the trust information for that domain
- Getting Started with Rational XDE Professional: A Guide for Visual Studio .NET Developers
- request for the permission of type system.net.webpermission ....... is failed
- ASP.NET操作EXCEL时出现的错误 Retrieving the COM class factory for component with CLSID(转)