Deep Learning 优化方法总结
2015-08-19 13:10
218 查看
http://blog.csdn.net/lien0906/article/details/47399823 摘自本博客
在 15年八月份, caffe中又增加了ADAM 方法。
Stochastic Gradient Descent (SGD)
SGD的参数
在使用随机梯度下降(SGD)的学习方法时,一般来说有以下几个可供调节的参数:Learning Rate 学习率
Weight Decay 权值衰减
Momentum 动量
Learning Rate Decay 学习率衰减
再此之中只有第一的参数(Learning Rate)是必须的,其余部分都是为了提高自适应性的参数,也就是说后3个参数不需要时可以设为0。
Learning Rate
学习率决定了权值更新的速度,设置得太大会使结果越过最优值,太小会使下降速度过慢。仅靠人为干预调整参数需要不断修改学习率,因此后面3种参数都是基于自适应的思路提出的解决方案。wi←wi−η∂E∂wi
Weight
decay
在实际运用中,为了避免模型的over-fitting,需要对cost function加入规范项,在SGD中我们加入$−ηλw_i$这一项来对cost function进行规范化。wi←wi−η∂E∂wi−ηλwi
这个公式的基本思路是减小不重要的参数对结果的影响,而有用的权重则不会受到Weight decay的影响,这种思路与Dropout的思路原理上十分相似。
Link 1
Link 2
Learning Rate Decay
一种提高SGD寻优能力的方法,具体做法是每次迭代减小学习率的大小。initial learning rate $\eta=\eta_0$
learning rate decay $\eta_d$
At each iteration $s$:
η(s)=η01+s⋅ηd
在许多论文中,另一种比较常见的方法是迭代30-50次左右直接对学习率进行操作($\eta←0.5\cdot\eta$)
Momentum
灵感来自于牛顿第一定律,基本思路是为寻优加入了“惯性”的影响,这样一来,当误差曲面中存在平坦区SGD可以一更快的速度学习。wi←m⋅wi−η∂E∂wi
注意:这里的表示方法并没有统一的规定,这里只是其中一种
Link 1
Link 2
Link 3
Link 4
SGD优缺点
实现简单,当训练样本足够多时优化速度非常快需要人为调整很多参数,比如学习率,收敛准则等
Averaged Stochastic Gradient Descent (ASGD)
在SGD的基础上计算了权值的平均值。$$\bar{w}t=\frac{1}{t-t_0}\sum^t{i=t_0+1} w_t$$
ASGD的参数
在SGD的基础上增加参数$t_0$学习率 $\eta$
参数 $t_0$
ASGD优缺点
运算花费和second order stochastic gradient descent (2SGD)一样小。比SGD的训练速度更为缓慢。
$t_0$的设置十分困难
Link 1
3. Conjugate Gradient(共轭梯度法)
介于最速下降法与牛顿法之间的一个方法,它仅仅需要利用一阶导数的信息,克服了GD方法收敛慢的特点。Link 1
Limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) (一种拟牛顿算法)
L-BFGS算法比较适合在大规模的数值计算中,具备牛顿法收敛速度快的特点,但不需要牛顿法那样存储Hesse矩阵,因此节省了大量的空间以及计算资源。Link 1
Link 2
Link 3
应用分析
不同的优化算法有不同的优缺点,适合不同的场合:LBFGS算法在参数的维度比较低(一般指小于10000维)时的效果要比SGD(随机梯度下降)和CG(共轭梯度下降)效果好,特别是带有convolution的模型。
针对高维的参数问题,CG的效果要比另2种好。也就是说一般情况下,SGD的效果要差一些,这种情况在使用GPU加速时情况一样,即在GPU上使用LBFGS和CG时,优化速度明显加快,而SGD算法优化速度提高很小。
在单核处理器上,LBFGS的优势主要是利用参数之间的2阶近视特性来加速优化,而CG则得得益于参数之间的共轭信息,需要计算器Hessian矩阵。
下面是 caffe 官网介绍
Solver
The solver orchestrates model optimization by coordinating the network’s forward inference and backward gradients to form parameter updates that attempt to improve the loss. The responsibilities of learning are divided between the Solver for overseeing theoptimization and generating parameter updates and the Net for yielding loss and gradients.
The Caffe solvers are Stochastic Gradient Descent (SGD), Adaptive Gradient (ADAGRAD), and Nesterov’s Accelerated Gradient (NESTEROV).
The solver
scaffolds the optimization bookkeeping and creates the training network for learning and test network(s) for evaluation.
iteratively optimizes by calling forward / backward and updating parameters
(periodically) evaluates the test networks
snapshots the model and solver state throughout the optimization
where each iteration
calls network forward to compute the output and loss
calls network backward to compute the gradients
incorporates the gradients into parameter updates according to the solver method
updates the solver state according to learning rate, history, and method
to take the weights all the way from initialization to learned model.
Like Caffe models, Caffe solvers run in CPU / GPU modes.
Methods
The solver methods address the general optimization problem of loss minimization. For dataset D,the optimization objective is the average loss over all |D| data
instances throughout the dataset
L(W)=1|D|∑i|D|fW(X(i))+λr(W)
where fW(X(i)) is
the loss on data instance X(i) and r(W) is
a regularization term with weight λ. |D| can
be very large, so in practice, in each solver iteration we use a stochastic approximation of this objective, drawing a mini-batch of N<<|D| instances:
L(W)≈1N∑iNfW(X(i))+λr(W)
The model computes fW in
the forward pass and the gradient ∇fW in
the backward pass.
The parameter update ΔW is
formed by the solver from the error gradient ∇fW,
the regularization gradient ∇r(W),
and other particulars to each method.
SGD
Stochastic gradient descent (solver_type: SGD) updates the weights W by
a linear combination of the negative gradient ∇L(W) and
the previous weight update Vt.
The learning rate α is
the weight of the negative gradient. The momentum μ is
the weight of the previous update.
Formally, we have the following formulas to compute the update value Vt+1 and
the updated weights Wt+1 at
iteration t+1,
given the previous weight update Vt and
current weights Wt:
Vt+1=μVt−α∇L(Wt)
Wt+1=Wt+Vt+1
The learning “hyperparameters” (α and μ)
might require a bit of tuning for best results. If you’re not sure where to start, take a look at the “Rules of thumb” below, and for further information you might refer to Leon Bottou’s Stochastic
Gradient Descent Tricks [1].
[1] L. Bottou. Stochastic
Gradient Descent Tricks. Neural Networks: Tricks of the Trade: Springer,
2012.
Rules of thumb for setting the learning rate α and
momentum μ
A good strategy for deep learning with SGD is to initialize the learning rate α toa value around α≈0.01=10−2,
and dropping it by a constant factor (e.g., 10) throughout training when the loss begins to reach an apparent “plateau”, repeating this several times. Generally, you probably want to use a momentum μ=0.9 or
similar value. By smoothing the weight updates across iterations, momentum tends to make deep learning with SGD both stabler and faster.
This was the strategy used by Krizhevsky et al. [1] in their famously winning CNN entry to the ILSVRC-2012 competition, and Caffe makes this strategy easy to implement in a
SolverParameter,
as in our reproduction of [1] at
./examples/imagenet/alexnet_solver.prototxt.
To use a learning rate policy like this, you can put the following lines somewhere in your solver prototxt file:
base_lr: 0.01 # begin training at a learning rate of 0.01 = 1e-2 lr_policy: "step" # learning rate policy: drop the learning rate in "steps" # by a factor of gamma every stepsize iterations gamma: 0.1 # drop the learning rate by a factor of 10 # (i.e., multiply it by a factor of gamma = 0.1) stepsize: 100000 # drop the learning rate every 100K iterations max_iter: 350000 # train for 350K iterations total momentum: 0.9
Under the above settings, we’ll always use
momentumμ=0.9.
We’ll begin training at a
base_lrof α=0.01=10−2 for
the first 100,000 iterations, then multiply the learning rate by
gamma(γ)
and train at α′=αγ=(0.01)(0.1)=0.001=10−3 for
iterations 100K-200K, then at α′′=10−4 for
iterations 200K-300K, and finally train until iteration 350K (since we have
max_iter: 350000) at α′′′=10−5.
Note that the momentum setting μ effectively
multiplies the size of your updates by a factor of 11−μ after
many iterations of training, so if you increase μ,
it may be a good idea to decrease αaccordingly
(and vice versa).
For example, with μ=0.9,
we have an effective update size multiplier of 11−0.9=10.
If we increased the momentum to μ=0.99,
we’ve increased our update size multiplier to 100, so we should drop α (
base_lr)
by a factor of 10.
Note also that the above settings are merely guidelines, and they’re definitely not guaranteed to be optimal (or even work at all!) in every situation. If learning diverges (e.g., you start to see very large or
NaNor
infloss
values or outputs), try dropping the
base_lr(e.g.,
base_lr: 0.001) and re-training, repeating this until you find a
base_lrvalue
that works.
[1] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet
Classification with Deep Convolutional Neural Networks. Advances in Neural
Information Processing Systems, 2012.
AdaGrad
The adaptive gradient (solver_type: ADAGRAD) method (Duchi et al. [1]) is a gradient-based optimization method (like SGD) that attempts to “find needles in haystacks in the form of very predictive but rarely seen features,” in Duchi et al.’s words. Given the update information from all
previous iterations (∇L(W))t′ for t′∈{1,2,...,t},
the update formulas proposed by [1] are as follows, specified for each component i of
the weights W:
(Wt+1)i=(Wt)i−α(∇L(Wt))i∑tt′=1(∇L(Wt′))2i−−−−−−−−−−−−−−√
Note that in practice, for weights W∈Rd,
AdaGrad implementations (including the one in Caffe) use only O(d) extra
storage for the historical gradient information (rather than the O(dt) storage
that would be necessary to store each historical gradient individually).
[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive
Subgradient Methods for Online Learning and Stochastic Optimization. The
Journal of Machine Learning Research, 2011.
NAG
Nesterov’s accelerated gradient (solver_type: NESTEROV) was proposed by Nesterov [1] as an “optimal” method of convex optimization, achieving a convergence rate of O(1/t2) rather
than the O(1/t).
Though the required assumptions to achieve the O(1/t2) convergence
typically will not hold for deep networks trained with Caffe (e.g., due to non-smoothness and non-convexity), in practice NAG can be a very effective method for optimizing certain types of deep learning architectures, as demonstrated for deep MNIST autoencoders
by Sutskever et al. [2].
The weight update formulas look very similar to the SGD updates given above:
Vt+1=μVt−α∇L(Wt+μVt)
Wt+1=Wt+Vt+1
What distinguishes the method from SGD is the weight setting W on
which we compute the error gradient ∇L(W) –
in NAG we take the gradient on weights with added momentum ∇L(Wt+μVt);
in SGD we simply take the gradient ∇L(Wt) on
the current weights themselves.
[1] Y. Nesterov. A Method of Solving a Convex Programming Problem with Convergence Rate O(1/k√). Soviet
Mathematics Doklady, 1983.
[2] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On
the Importance of Initialization and Momentum in Deep Learning. Proceedings
of the 30th International Conference on Machine Learning, 2013.
Scaffolding
The solver scaffolding prepares the optimization method and initializes the model to be learned in Solver::Presolve().
> caffe train -solver examples/mnist/lenet_solver.prototxt I0902 13:35:56.474978 16020 caffe.cpp:90] Starting Optimization I0902 13:35:56.475190 16020 solver.cpp:32] Initializing solver from parameters: test_iter: 100 test_interval: 500 base_lr: 0.01 display: 100 max_iter: 10000 lr_policy: "inv" gamma: 0.0001 power: 0.75 momentum: 0.9 weight_decay: 0.0005 snapshot: 5000 snapshot_prefix: "examples/mnist/lenet" solver_mode: GPU net: "examples/mnist/lenet_train_test.prototxt"
Net initialization
I0902 13:35:56.655681 16020 solver.cpp:72] Creating training net from net file: examples/mnist/lenet_train_test.prototxt [...] I0902 13:35:56.656740 16020 net.cpp:56] Memory required for data: 0 I0902 13:35:56.656791 16020 net.cpp:67] Creating Layer mnist I0902 13:35:56.656811 16020 net.cpp:356] mnist -> data I0902 13:35:56.656846 16020 net.cpp:356] mnist -> label I0902 13:35:56.656874 16020 net.cpp:96] Setting up mnist I0902 13:35:56.694052 16020 data_layer.cpp:135] Opening lmdb examples/mnist/mnist_train_lmdb I0902 13:35:56.701062 16020 data_layer.cpp:195] output data size: 64,1,28,28 I0902 13:35:56.701146 16020 data_layer.cpp:236] Initializing prefetch I0902 13:35:56.701196 16020 data_layer.cpp:238] Prefetch initialized. I0902 13:35:56.701212 16020 net.cpp:103] Top shape: 64 1 28 28 (50176) I0902 13:35:56.701230 16020 net.cpp:103] Top shape: 64 1 1 1 (64) [...] I0902 13:35:56.703737 16020 net.cpp:67] Creating Layer ip1 I0902 13:35:56.703753 16020 net.cpp:394] ip1 <- pool2 I0902 13:35:56.703778 16020 net.cpp:356] ip1 -> ip1 I0902 13:35:56.703797 16020 net.cpp:96] Setting up ip1 I0902 13:35:56.728127 16020 net.cpp:103] Top shape: 64 500 1 1 (32000) I0902 13:35:56.728142 16020 net.cpp:113] Memory required for data: 5039360 I0902 13:35:56.728175 16020 net.cpp:67] Creating Layer relu1 I0902 13:35:56.728194 16020 net.cpp:394] relu1 <- ip1 I0902 13:35:56.728219 16020 net.cpp:345] relu1 -> ip1 (in-place) I0902 13:35:56.728240 16020 net.cpp:96] Setting up relu1 I0902 13:35:56.728256 16020 net.cpp:103] Top shape: 64 500 1 1 (32000) I0902 13:35:56.728270 16020 net.cpp:113] Memory required for data: 5167360 I0902 13:35:56.728287 16020 net.cpp:67] Creating Layer ip2 I0902 13:35:56.728304 16020 net.cpp:394] ip2 <- ip1 I0902 13:35:56.728333 16020 net.cpp:356] ip2 -> ip2 I0902 13:35:56.728356 16020 net.cpp:96] Setting up ip2 I0902 13:35:56.728690 16020 net.cpp:103] Top shape: 64 10 1 1 (640) I0902 13:35:56.728705 16020 net.cpp:113] Memory required for data: 5169920 I0902 13:35:56.728734 16020 net.cpp:67] Creating Layer loss I0902 13:35:56.728747 16020 net.cpp:394] loss <- ip2 I0902 13:35:56.728767 16020 net.cpp:394] loss <- label I0902 13:35:56.728786 16020 net.cpp:356] loss -> loss I0902 13:35:56.728811 16020 net.cpp:96] Setting up loss I0902 13:35:56.728837 16020 net.cpp:103] Top shape: 1 1 1 1 (1) I0902 13:35:56.728849 16020 net.cpp:109] with loss weight 1 I0902 13:35:56.728878 16020 net.cpp:113] Memory required for data: 5169924
Loss
I0902 13:35:56.728893 16020 net.cpp:170] loss needs backward computation. I0902 13:35:56.728909 16020 net.cpp:170] ip2 needs backward computation. I0902 13:35:56.728924 16020 net.cpp:170] relu1 needs backward computation. I0902 13:35:56.728938 16020 net.cpp:170] ip1 needs backward computation. I0902 13:35:56.728953 16020 net.cpp:170] pool2 needs backward computation. I0902 13:35:56.728970 16020 net.cpp:170] conv2 needs backward computation. I0902 13:35:56.728984 16020 net.cpp:170] pool1 needs backward computation. I0902 13:35:56.728998 16020 net.cpp:170] conv1 needs backward computation. I0902 13:35:56.729014 16020 net.cpp:172] mnist does not need backward computation. I0902 13:35:56.729027 16020 net.cpp:208] This network produces output loss I0902 13:35:56.729053 16020 net.cpp:467] Collecting Learning Rate and Weight Decay. I0902 13:35:56.729071 16020 net.cpp:219] Network initialization done. I0902 13:35:56.729085 16020 net.cpp:220] Memory required for data: 5169924 I0902 13:35:56.729277 16020 solver.cpp:156] Creating test net (#0) specified by net file: examples/mnist/lenet_train_test.prototxt
Completion
I0902 13:35:56.806970 16020 solver.cpp:46] Solver scaffolding done. I0902 13:35:56.806984 16020 solver.cpp:165] Solving LeNet
Updating Parameters
The actual weight update is made by the solver then applied to the net parameters inSolver::ComputeUpdateValue().
The
ComputeUpdateValuemethod incorporates any weight decay r(W)into
the weight gradients (which currently just contain the error gradients) to get the final gradient with respect to each network weight. Then these gradients are scaled by the learning rate α and
the update to subtract is stored in each parameter Blob’s
difffield.
Finally, the
Blob::Updatemethod is called on each parameter
blob, which performs the final update (subtracting the Blob’s
difffrom
its
data).
Snapshotting and Resuming
The solver snapshots the weights and its own state during training in Solver::Snapshot()and
Solver::SnapshotSolverState().
The weight snapshots export the learned model while the solver snapshots allow training to be resumed from a given point. Training is resumed by
Solver::Restore()and
Solver::RestoreSolverState().
Weights are saved without extension while solver states are saved with
.solverstateextension.
Both files will have an
_iter_Nsuffix for the snapshot iteration
number.
Snapshotting is configured by:
# The snapshot interval in iterations. snapshot: 5000 # File path prefix for snapshotting model weights and solver state. # Note: this is relative to the invocation of the `caffe` utility, not the # solver definition file. snapshot_prefix: "/path/to/model" # Snapshot the diff along with the weights. This can help debugging training # but takes more storage. snapshot_diff: false # A final snapshot is saved at the end of training unless # this flag is set to false. The default is true. snapshot_after_train: true
in the solver definition prototxt.
相关文章推荐
- 移动GPU压缩纹理的使用方法
- 通过注册表修改键盘
- HDU 5400 Arithmetic Sequence
- HDU 5333 Undirected Graph【LCT+BIT】
- 玩转swift字符串——Advanced
- [MetaHook] GameUI hook
- linux命令 - 建立目录mkdir
- [leetcode] Search a 2D Matrix II
- BZOJ 2037 [Sdoi2008]Sue的小球 DP
- crontab 案例
- java自带线程池和队列详细讲解
- 类与类之间的关系
- hdu1016-Prime Ring Problem(回溯法)
- 敏捷开发之Scrum基础
- 时间和日期函数
- Java去除所有非中文字符串
- [Email] 收发邮件的协议 : IMAP and SMTP , POP3 and SMTP
- Linux下禁用Firefox浏览器的静默请求教程
- 自动备份mysql数据库脚本,适用小量数据
- POJ 1811 Prime Test