Caffe中learning rate 和 weight decay 的理解
2015-11-26 14:59
260 查看
Caffe中learning rate 和 weight decay 的理解
在caffe.proto中 对caffe网络中出现的各项参数做了详细的解释。1.关于learning rate
optional float base_lr = 5; // The base learning rate // The learning rate decay policy. The currently implemented learning rate // policies are as follows: // - fixed: always return base_lr. // - step: return base_lr * gamma ^ (floor(iter / step)) // - exp: return base_lr * gamma ^ iter // - inv: return base_lr * (1 + gamma * iter) ^ (- power) // - multistep: similar to step but it allows non uniform steps defined by // stepvalue // - poly: the effective learning rate follows a polynomial decay, to be // zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power) // - sigmoid: the effective learning rate follows a sigmod decay // return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize)))) // // where base_lr, max_iter, gamma, step, stepvalue and power are defined // in the solver parameter protocol buffer, and iter is the current iteration. optional string lr_policy = 8; optional float gamma = 9; // The parameter to compute the learning rate. optional float power = 10; // The parameter to compute the learning rate. optional float momentum = 11; // The momentum value. optional float weight_decay = 12; // The weight decay. // regularization types supported: L1 and L2 // controlled by weight_decay optional string regularization_type = 29 [default = "L2"]; // the stepsize for learning rate policy "step"
2. 关于weight decay
在机器学习或者模式识别中,会出现overfitting,而当网络逐渐overfitting时网络权值逐渐变大,因此,为了避免出现overfitting,会给误差函数添加一个惩罚项,常用的惩罚项是所有权重的平方乘以一个衰减常量之和。其用来惩罚大的权值。
regularization controlled by weight_decay
权值衰减惩罚项使得权值收敛到较小的绝对值,而惩罚大的权值。因为大的权值会使得系统出现过拟合,降低其泛化性能。
The
weight_decayparameter govern the regularization term of the neural net.
During training a regularization term is added to the network's loss to compute the backprop gradient. The
weight_decayvalue determines how dominant this regularization term will be in the gradient computation.
As a rule of thumb, the more training examples you have, the weaker this term should be. The more parameters you have (i.e., deeper net, larger filters, large InnerProduct layers etc.) the higher this term should be.
Caffe also allows you to choose between
L2regularization (default) and
L1regularization, by setting
regularization_type: "L1"
While learning rate may (and usually does) change during training, the regularization weight is fixed throughout.
4.1.1SGD
Stochastic gradient descent ( solver_type: SGD ) updates the weights W by a linear combination of the negative gradient ∇L(W) and the previous weight update V t .
The learning rate α is the weight of the negative gradient. The momentum μ is the weight of the previous update.
Formally, we have the following formulas to compute the update value V t+1 and the updated weights W t+1 at iteration t+1 , given the previous weight update V t and current weights W t :
V t+1 =μV t −α∇L(W t )
W t+1 =W t +V t+1
The learning “hyperparameters” ( α and μ ) might require a bit of tuning for best results. If you’re not sure where to start, take a look at the “Rules of thumb” below, and for further information you might refer to Leon Bottou’s Stochastic Gradient Descent
Tricks [1]. [1] L. Bottou. Stochastic Gradient Descent Tricks. Neural Networks: Tricks of the Trade:Springer, 2012.
4.1.1.1.Rules of thumb for setting the learning rate α and momentum μ
A good strategy for deep learning with SGD is to initialize the learning rate α to a value
around α≈0.01=10 −2 , and dropping it by a constant factor (e.g., 10) throughout training
when the loss begins to reach an apparent “plateau”, repeating this several times.
Generally, you probably want to use a momentum μ=0.9 or similar value. By smoothing
the weight updates across iterations, momentum tends to make deep learning with SGD
both stabler and faster.
这里 μ = momentum α = base_lr
Difference between neural net weight decay and learning rate
The learning rate is a parameter that determines how much an updating step influences the current value of the weights. While weight decay is an additional term in the weight update rule that causes the weights to exponentially decay to zero, if no otherupdate is scheduled.
So let's say that we have a cost or error function
E(w)
that we want to minimize. Gradient descent tells us to modify the weights
w
in the direction of steepest descent in
E:
wi←wi−η∂E∂wi,
where η
is the learning rate, and if it's large you will have a correspondingly large modification of the weights
wi
(in general it shouldn't be too large, otherwise you'll overshoot the local minimum in your cost function).
In order to effectively limit the number of free parameters in your model so as to avoid over-fitting, it is possible to regularize the cost function. An easy way to do that is by introducing a zero mean Gaussian prior over the weights, which is equivalent
to changing the cost function to
E˜(w)=E(w)+λ2w2.
In practice this penalizes large weights and effectively limits the freedom in your model. The regularization parameter
λ
determines how you trade off the original cost
E
with the large weights penalization.
Applying gradient descent to this new cost function we obtain:
wi←wi−η∂E∂wi−ηλwi.
The new term −ηλwi
coming from the regularization causes the weight to decay in proportion to its size.
相关文章推荐
- HTML5游戏引擎Phaser初体验
- 应该知道的30个jQuery代码开发技巧
- HTML5游戏引擎Phaser初体验
- 局部打印插件 jquery.PrintArea.js
- js 扩展Array支持remove方法
- js替换链接和文字
- poj 1388 Hinge Node Problem floyd算法的扩展运用
- 有两个不同版本的jQuery库,冲突的解决办法。
- js实现延时加载Flash的方法
- Bootstrap tabs选项卡实现
- 学习JavaScript设计模式(链式调用)
- JS的作用域和作用域链
- javascript 如何获取return回来的对象值
- JavaScript中的ForIn遍历
- ReactNativeiOS(一)编辑器
- 初学Jquery EasyUI需要注意的问题
- 调皮的JavaScript
- NodeJS中使用async控制并发-@CAOLAN
- 学习JavaScript设计模式(继承)
- HTML5+ 创建自定义插件