您的位置:首页 > 运维架构

Study Notes of Neural Network Optimization(1)

2016-01-21 09:52 477 查看
1/ Things to do before train the network: pre-train the network

Normal network pre-training methods include using stacks of RBMs (Restrict Boltzmann Machine); autoencoders and Deep Boltzmann Machines. 

What network pre-training do is to bring the initialization points to a better point that has more possibility to reach a better minimum point. 

Here is a good picture to illustrate this:

(Reference: http://stackoverflow.com/questions/34514687/how-does-pre-training-improve-classification-in-neural-networks)


There are two interesting images come from this website:

Now, I will introduce what autocoder do:

(Reference: http://www.cnblogs.com/rong86/p/3555290.html)
Different from normal supervised training during which training dataset has its own labels. The error computation of autocoder is:

How we pre-train our network is to let a particular one layer become the encoder we design. And we manually introduce a decoder in it. If the encoder's output did not well present the input. Then we need to adjust the parameters of this layer. 

There are many autocoders have more complex architecture like sparse auto-encoder and de-noising auto-encoder.

2/ How to let the parameters matrix(θ) sparse 

At first, I would like to talk about why should we make the θ sparse. When we are talking about sparse, we are hoping that the matrix has more zeros in it. If a matrix has more zeros, the layer it belongs will be more anti-disturb because this layer will
focus on the 'important' features of the input. 

Another perspective to understand this is that a sparser parameters' matrix will have a more smooth decision boundary since some items of polynomial are zero. This will be more likely to prevent over-fitting. 


Here is a good understanding from http://blog.csdn.net/zouxy09/article/details/24971995 and I quote (Chinese): 

监督机器学习问题无非就是“minimize your error while regularizing your parameters”,也就是在规则化参数的同时最小化误差。最小化误差是为了让我们的模型拟合我们的训练数据,而规则化参数是防止我们的模型过分拟合我们的训练数据。多么简约的哲学啊!因为参数太多,会导致我们的模型复杂度上升,容易过拟合,也就是我们的训练误差会很小。但训练误差小并不是我们的最终目标,我们的目标是希望模型的测试误差小,也就是能准确的预测新的样本。所以,我们需要保证模型“简单”的基础上最小化训练误差,这样得到的参数才具有好的泛化性能(也就是测试误差也小).


How to make the parameters' matrix sparse?

(1) Choose the right activation function:

(Reference: http://www.cnblogs.com/neopenx/p/4453161.html)
Sigmoid function: 

They are traditional function of sigmoid: Logistic-Sigmoid/ Tanh-Sigmoid. 

The expression are: 




From http://www.cnblogs.com/neopenx/p/4453161.html:




Softplus and Rectifier Linear Unit(ReLU) 


ReLU = max(0,x). 

Actually, in neural science research: 


From http://www.cnblogs.com/neopenx/p/4453161.html













Also max(0,x) can address the vanishing gradient problem of sigmoid: 

更倾向于使用线性神经激活函数的另外一个原因是,减轻梯度法训练深度网络时的Vanishing Gradient Problem。



①Sigmoid'(x)∈(0,1)  导数缩放

②x∈(0,1)或x∈(-1,1)  饱和值缩放



Softplus函数则稍微慢点,Softplus'(x)=Sigmoid(x)∈(0,1) ,但是也是单端饱和,因而速度仍然会比Sigmoid系函数快。

(2) Regulation

2.1 L0, L1 and L2 norm

First, let me introduce what is L0, L1 and L2 norm. 

L0 norm is number of zero in matrix w. L1 and L2 have a good picture to show: 

The update function of supervised learning is: 

Normally, we will introduce L0, L1 and L2 norm to Ω(w). Particularly, L2 norm has another name call 'weight decay'. Being different from L0 and L1, L2 just regularize the parameters to small but not zero. 

A detailed explanation of those is: http://blog.csdn.net/zouxy09/article/details/24971995

(wait for continue)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息