deep learning 学习笔记
2015-11-26 16:27
281 查看
link: http://ufldl.stanford.edu/tutorial/
forward propagation
用input feature x 通过activation function 来计算最后输出的prediction结果的过程
backpropagation algorithm
整个NN也是算一个loss function 再用batch gradient descent 来计算W和b就行。只是在求W(l)ijW^{(l)}_{ij}和b(l)jb^{(l)}_j的partial derivative的时候,要用这个backpropagation algorithm来计算。总体思想就是求出先求出最后prediction与true value的difference,然后再向后计算出每一layer的对这个difference的contribution,利用这个contribution可以求出每个partial derivative. 这样求partial derivative会更快。
这里partial derivative是每个training sample的partial derivative的和,就是说没求一次partial derivative 就要scan所有的training sample
当input 的sample 的维度extremely large, we can firstly apply convolution降维, 然后利用pooling降维
SGD
compared with the batch GD. just use a single training example or a small amount of examples called “minibatch”, ususally 256.
notice the the term “minibatch”, “epoch”(iteration over the whole data set), “shuffle”
One final but important point regarding SGD is the order in which we present the data to the algorithm. If the data is given in some meaningful order, this can bias the gradient and lead to poor convergence. Generally a good method to avoid this is to randomly shuffle the data prior to each epoch of training.
Momentum
If the objective has the form of a long shallow ravine leading to the optimum and steep walls on the sides, standard SGD will tend to oscillate across the narrow ravine since the negative gradient will point down one of the steep sides rather than along the ravine towards the optimum. The objectives of deep architectures have this form near local optima and thus standard SGD can lead to very slow convergence particularly after the initial steep gains. Momentum is one method for pushing the objective more quickly along the shallow ravine.
这里就是说如果objective function是一个很陡的谷底,那么每次update都很容易从谷的一边跑到另外一边,即在谷内震荡下行。
Note
filter / kernel: for example, it is the 8*8 patch for convolution
after convolution, we get feature map.
CNN consists of three parts.
normal fully connected NN layers in CNN.
subsampling layer is the pooling layer in CNN.
convolutional layer in CNN
Comparison
对比普通NN, CNN就是可以利用convolution and pooling 处理高维的输入数据
feature extraction for unsupervised learning when we don’t have the trained labels.
definition
p(w1,...,wT)p(w_1, ..., w_T) the probability of a word sequence
probabilistic chain rule
p(w1,...,wT)=p(w1)∏Ti=2p(wi|w1,...,wi−1)=p(w1)∏Ti=2p(wi|hi)p(w_1,..., w_T) = p(w_1) \prod_{i=2}^T p(w_i|w_1, ..., w_{i-1})= p(w_1) \prod_{i=2}^Tp(w_i|h_i) where hih_i denotes the history of the ith word wiw_i
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
definition
The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea.
i.e. RNN considers the dependency of the training samples. In other words, each training sample has somewhat dependency or sequential relationship
note
the ref (http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/) wildml has provided a specific implementation example.
several issues should be noted:
1.在generating text的code中
这里其实是用的theano中的forward_propagation,不是作者自己写的forward_propagation.
所以这里next_word_probs[-1] 其实是表示的o[-1],对应input x的最后一个word,然后samples就是一个one-hot-vector,再用np.argmax取得index。
只是为了克服RNN cannot capture the long dependency的问题。
只是计算sts_t的方法不一样
cross-entropy loss, i.e. log loss function or logistic loss
http://colah.github.io/posts/2015-08-Backprop/
http://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
内含推导BP公式过程
https://theclevermachine.wordpress.com/2014/09/06/derivation-error-backpropagation-gradient-descent-for-neural-networks/
注意δli\delta^l_i是 total error 对 zliz^l_i的偏导数
computational graph
forward-mode differentiation,一次只能计算output对其中一个input的偏导数,即如果在computational graph 中input很多,那么这种方法计算偏导数就会很慢。求从其中一条input到output所有的path的∂AllPathNode∂b\frac{\partial AllPathNode}{\partial b}
reverse-mode differentiation 更快,可以一次性计算 output对所有的node的偏导数,就是backpropagation. ∂Z∂AllNode\frac{\partial Z}{\partial All Node}
1. linear regression
use MLE to understand the loss function
2. logistic regression— binary classification
use MLE to understand the loss function3. softmax regression —multiple classification
4. Neural Network
activation function: sigmod(0,1), tanh[-1,1], rectified linear(0,+inf)forward propagation
用input feature x 通过activation function 来计算最后输出的prediction结果的过程
backpropagation algorithm
整个NN也是算一个loss function 再用batch gradient descent 来计算W和b就行。只是在求W(l)ijW^{(l)}_{ij}和b(l)jb^{(l)}_j的partial derivative的时候,要用这个backpropagation algorithm来计算。总体思想就是求出先求出最后prediction与true value的difference,然后再向后计算出每一layer的对这个difference的contribution,利用这个contribution可以求出每个partial derivative. 这样求partial derivative会更快。
这里partial derivative是每个training sample的partial derivative的和,就是说没求一次partial derivative 就要scan所有的training sample
Supervised CNN
feature extraction by convolution当input 的sample 的维度extremely large, we can firstly apply convolution降维, 然后利用pooling降维
SGD
compared with the batch GD. just use a single training example or a small amount of examples called “minibatch”, ususally 256.
notice the the term “minibatch”, “epoch”(iteration over the whole data set), “shuffle”
One final but important point regarding SGD is the order in which we present the data to the algorithm. If the data is given in some meaningful order, this can bias the gradient and lead to poor convergence. Generally a good method to avoid this is to randomly shuffle the data prior to each epoch of training.
Momentum
If the objective has the form of a long shallow ravine leading to the optimum and steep walls on the sides, standard SGD will tend to oscillate across the narrow ravine since the negative gradient will point down one of the steep sides rather than along the ravine towards the optimum. The objectives of deep architectures have this form near local optima and thus standard SGD can lead to very slow convergence particularly after the initial steep gains. Momentum is one method for pushing the objective more quickly along the shallow ravine.
这里就是说如果objective function是一个很陡的谷底,那么每次update都很容易从谷的一边跑到另外一边,即在谷内震荡下行。
Note
filter / kernel: for example, it is the 8*8 patch for convolution
after convolution, we get feature map.
CNN consists of three parts.
normal fully connected NN layers in CNN.
subsampling layer is the pooling layer in CNN.
convolutional layer in CNN
Comparison
对比普通NN, CNN就是可以利用convolution and pooling 处理高维的输入数据
Sparse coding && PCA
ICA && RICA
unsupervised learning
autoencoderfeature extraction for unsupervised learning when we don’t have the trained labels.
Statistical Language Modeling (SLM)
ref: http://homepages.inf.ed.ac.uk/lzhang10/slm.htmldefinition
p(w1,...,wT)p(w_1, ..., w_T) the probability of a word sequence
probabilistic chain rule
p(w1,...,wT)=p(w1)∏Ti=2p(wi|w1,...,wi−1)=p(w1)∏Ti=2p(wi|hi)p(w_1,..., w_T) = p(w_1) \prod_{i=2}^T p(w_i|w_1, ..., w_{i-1})= p(w_1) \prod_{i=2}^Tp(w_i|h_i) where hih_i denotes the history of the ith word wiw_i
RNN
ref: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/http://karpathy.github.io/2015/05/21/rnn-effectiveness/
definition
The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea.
i.e. RNN considers the dependency of the training samples. In other words, each training sample has somewhat dependency or sequential relationship
note
the ref (http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/) wildml has provided a specific implementation example.
several issues should be noted:
1.在generating text的code中
next_word_probs = model.forward_propagation(new_sentence)
这里其实是用的theano中的forward_propagation,不是作者自己写的forward_propagation.
while sampled_word == word_to_index[unknown_token]: samples = np.random.multinomial(1, next_word_probs[-1]) sampled_word = np.argmax(samples)
所以这里next_word_probs[-1] 其实是表示的o[-1],对应input x的最后一个word,然后samples就是一个one-hot-vector,再用np.argmax取得index。
LSTM
a type of RNN can capture a long dependency只是为了克服RNN cannot capture the long dependency的问题。
只是计算sts_t的方法不一样
loss function
least mean squarecross-entropy loss, i.e. log loss function or logistic loss
Backpropagation Alg.
http://cs231n.github.io/optimization-1/http://colah.github.io/posts/2015-08-Backprop/
http://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
内含推导BP公式过程
https://theclevermachine.wordpress.com/2014/09/06/derivation-error-backpropagation-gradient-descent-for-neural-networks/
注意δli\delta^l_i是 total error 对 zliz^l_i的偏导数
computational graph
forward-mode differentiation,一次只能计算output对其中一个input的偏导数,即如果在computational graph 中input很多,那么这种方法计算偏导数就会很慢。求从其中一条input到output所有的path的∂AllPathNode∂b\frac{\partial AllPathNode}{\partial b}
reverse-mode differentiation 更快,可以一次性计算 output对所有的node的偏导数,就是backpropagation. ∂Z∂AllNode\frac{\partial Z}{\partial All Node}
相关文章推荐
- <9>编写硬件抽象层模块接口
- STM32 printf函数的调用
- 63、具有过渡动画效果的布局Layout
- WINDOWS 中 UNICODE的转换
- Volley之自定义XMLRequest
- [poj 2112]Optimal Milking
- iOSMVC设计模式
- Android修改修改JavaBean的属性导致Sqlite没有及时更新的问题
- 商品详情页系统的Servlet3异步化实践
- Swift静态方法
- Swift静态属性
- 解决Docker build时 Sending build context to Docker daemon 过大的问题
- 代理模式
- linkbutton按钮组件
- 信息学奥林匹克竞赛-三连击
- JAVA特殊API
- 8大爱的方式让您的婚礼绿色环保
- 8大爱的方式让您的婚礼绿色环保
- 华为机试——数组循环移位
- 不同控制器间model传递