RNN(二) 前向和BPTT
2015-11-28 23:26
253 查看
RNN(二) 前向和BPTT
标签(空格分隔): RNN BPTTbasic definition
To simply notation, the RNN here only contains one input layer, one hidden layer and one putput layer. Notations are listed below:neural layer | node | index | number |
---|---|---|---|
input layer | x(t) | i | N |
previous hidden layer | s(t) | h | M |
hidden layer | s(t-1) | j | M |
output layer | y(t) | k | O |
input->hidden | V(t) | i,j | N->M |
previous hidden->hidden | U(t) | h,j | M->M |
hidden->output | W(t) | j,k | M->O |
forward
1. input->hidden
netj(t)=∑iNxi(t)vji+∑hMsh(t−1)ujh+θjnet_j(t)=\sum_{i}^Nx_i(t)v_{ji}+\sum_{h}^Ms_h(t-1)u_{jh}+\theta_j
sj(t)=f(netj(t))s_j(t)=f(net_j(t))
2. hidden->output
netk(t)=∑jMsj(t)wkj+θknet_k(t)=\sum_{j}^Ms_j(t)w_{kj}+\theta_k
yk(t)=g(netk(t))y_k(t)=g(net_k(t))
f and g are the activate functions of hidden layer and output layer respectively.
backpropagation
prerequisite
Any network structure can be trained with backpropagation when desired output patterns exist and each function that has been used to calculate the actual output patterns is differentiable.cost function
1.summed squared error(SSE)The cost function can be any differentiable function that is able to measure the loss of the predicted values from the gold answers. The SSE is frequently-used, and works well in the training of conventional feed-forward neural networks.
C=12∑lP∑kO(dlk−ylk)2C=\frac{1}{2}\sum_l^P\sum_k^O(d_{lk}-y_{lk})^2
2.cross extropy(CE)
The cross-entropy loss is used in Recurrent Neural Network Language Models(RNNLM) and performs well.
C=−∑lP∑kOdlklnylk+(1−dlk)ln(1−ylk)C=-\sum_l^P\sum_k^Od_{lk}\ln y_{lk}+(1-d_{lk})\ln(1-y_{lk})
Discussion below is based on SSE.
error component
error for output nodesδlk=−∂C∂netlk=−∂C∂ylk∂ylk∂netlk=(dlk−ylk)g′(ylk)\delta_{lk}=-\frac{\partial C}{\partial net_{lk}}=-\frac{\partial C}{\partial y_{lk}}\frac{\partial y_{lk}}{\partial net_{lk}}=(d_{lk}-y_{lk})g'(y_{lk})
error for hidden nodes
δlj=−(∑kO∂C∂ylk∂ylk∂netlk∂netlk∂slj)∂slj∂netlj=∑kOδlkwkjf′(netlj)\delta_{lj}=-(\sum_k^O\frac{\partial C}{\partial y_{lk}}\frac{\partial y_{lk}}{\partial net_{lk}}\frac{\partial net_{lk}}{\partial s_{lj}})\frac{\partial s_{lj}}{\partial net_{lj}}=\sum_k^O\delta_{lk}w_{kj}f'(net_{lj})
activate function
sigmoidf(net)=11+e−netf(net)=\frac{1}{1+e^{-net}}
f′(net)=f(net){1−f(net)}f'(net)=f(net)\{1-f(net)\}
softmax
g(netk)=enetk∑Okenetkg(net_k)=\frac{e^{net_k}}{\sum_k^Oe^{net_k}}
g′(netk)=enetk(∑Ojenetj−enetk)(∑Ojenetj)2g'(net_k)=\frac{e^{net_k}(\sum_j^Oe^{net_j}-e^{net_k})}{{(\sum_j^Oe^{net_j}})^2}
gradient descent
According to the gradient descent, each weight change in the network should be proportional to the negative gradient of the cost function, with respect to the specic weight:Δw=−η∂C∂w\Delta w=-\eta \frac{\partial C}{\partial w}
where η\eta is the learning rate.
1. hidden->output
Δwkj=−η∂C∂wkj=η∑lP(−∂C∂netlk)∂netlk∂wkj=η∑lPδlk∂netlk∂wkj=η∑lPδlkslj\Delta w_{kj}=-\eta \frac{\partial C}{\partial w_{kj}}=\eta \sum_l^P(-\frac{\partial C}{\partial net_{lk}})\frac{\partial net_{lk}}{\partial w_{kj}}=\eta \sum_l^P\delta_{lk}\frac{\partial net_{lk}}{\partial w_{kj}}=\eta \sum_l^P\delta_{lk}s_{lj}
2. input->hidden
Δvji=−η∂C∂vji=η∑lPδljxli\Delta v_{ji}=-\eta \frac{\partial C}{\partial v_{ji}}=\eta \sum_l^P\delta_{lj}x_{li}
3. previous hidden->hidden
Δujh=−η∂C∂ujh=η∑lPδljs(l−1)h\Delta u_{jh}=-\eta \frac{\partial C}{\partial u_{jh}}=\eta \sum_l^P\delta_{lj}s_{(l-1)h}
unfolding
In a recurrent neural network, errors can be propagated further, i.e. more than 2 layers, in order to capture longer history information. This process is usually called unfolding.In an unfolded RNN, the recurrent weight is duplicated spatially for an arbitrary number of time steps, here refered to as T.
netlj(t)=∑iNxli(t)vji+∑hMs(l−1)hujh+θjnet_{lj}(t)=\sum_{i}^Nx_{li}(t)v_{ji}+\sum_{h}^Ms_{(l-1)h}u_{jh}+\theta_j
s(l−1)h=f(net(l−1)h)s_{(l-1)h}=f(net_{(l-1)h})
Error for hidden nodes through time as:
δlj(t−1)=−∂C∂net(l−1)j=−∑hM∂C∂netlh∂netlh∂net(l−1)j\delta_{lj}(t-1)=-\frac{\partial C}{\partial net_{(l-1)j}}=-\sum_h^M\frac{\partial C}{\partial net_{lh}}\frac{\partial net_{lh}}{\partial net_{(l-1)j}}
=(−∑hM∂C∂netlh)(∂netlh∂s(l−1)j)(∂s(l−1)j∂net(l−1)j)=(-\sum_h^M\frac{\partial C}{\partial net_{lh}})(\frac{\partial net_{lh}}{\partial s_{(l-1)j}})(\frac{\partial s_{(l-1)j}}{\partial net_{(l-1)j}})
=∑hMδlh(t)uhjf′(net(l−1)j)=\sum_h^M\delta_{lh}(t)u_{hj}f'(net_{(l-1)j})
where h is the index for the hidden node at time step t, and j for the hidden node at time step t-1.
此处原始论文使用的是slj(t−1)s_{lj}(t-1),个人感觉应该是netlj(t−1)net_{lj}(t-1),但是这种表示方式又不好解释,因为tt时刻对应的下标是ll,t−1t-1时刻对应的下标也是ll,所以修改成了net(l−1)jnet_{(l-1)j},认为tt时刻对应的为ll,t−1t-1时刻对应的是l−1l-1.
After all error deltas have been obtained, weights are folded back adding up to one big change for each unfolded weights.
1. input->hidden
Δvji(t)=η∑zT∑lPδlj(t−z)x(l−z)i\Delta v_{ji}(t)=\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)x_{(l-z)i}
2. previous hidden->hidden
Δujh(t)=η∑zT∑lPδlj(t−z)s(l−1−z)h\Delta u_{jh}(t)=\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)s_{(l-1-z)h}
summary
input->hiddenvji(t+1)=vji(t)+η∑zT∑lPδlj(t−z)x(l−z)iv_{ji}(t+1)=v_{ji}(t)+\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)x_{(l-z)i}
previous hidden->hidden
ujh(t+1)=ujh(t)+η∑zT∑lPδlj(t−z)s(l−1−z)hu_{jh}(t+1)=u_{jh}(t)+\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)s_{(l-1-z)h}
hidden->output
wkj(t+1)=wkj(t)+η∑lPδlksljw_{kj}(t+1)=w_{kj}(t)+\eta \sum_l^P\delta_{lk}s_{lj}
references
BackPropagation Through TimeA guide to recurrent neural networks and backpropagation
相关文章推荐
- 读过的书及读后感
- Eclipse 安装反编译插件jadclipse(经验总结)
- 如何开发一个App(Android)
- Android资源文件中特殊符号的含义
- 黑马程序员——java基础语法
- 防止ajax请求重发
- 关于php前后端如何识别json格式小结
- Android5.0新特性
- HDU 5587 数学
- 【荐】PHP上传文件大小限制大全
- BestCoder Round #64 (div.2) 1003 Array HDU 5587
- 【JavaScript】js操作本地文件
- MyEclipse 项目出现版本问题
- “数学口袋精灵”第二个Sprint计划(第五天)
- 关于OA系统ie8上传附件无法正常显示的问题
- 为导航栏的li加上.selected样式
- Framgment的使用
- 【学神-RHEL7】1-6-RHEL7用户管理和如何恢复root密码
- Android平台多国语的语言以及国家的代码参考
- C语言开发总结(十六)