您的位置:首页 > 其它

RNN(二) 前向和BPTT

2015-11-28 23:26 253 查看

RNN(二) 前向和BPTT

标签(空格分隔): RNN BPTT

basic definition

To simply notation, the RNN here only contains one input layer, one hidden layer and one putput layer. Notations are listed below:

neural layernodeindexnumber
input layerx(t)iN
previous hidden layers(t)hM
hidden layers(t-1)jM
output layery(t)kO
input->hiddenV(t)i,jN->M
previous hidden->hiddenU(t)h,jM->M
hidden->outputW(t)j,kM->O
Besides, P is the total number of available training samples which are indexed by l

forward



1. input->hidden

netj(t)=∑iNxi(t)vji+∑hMsh(t−1)ujh+θjnet_j(t)=\sum_{i}^Nx_i(t)v_{ji}+\sum_{h}^Ms_h(t-1)u_{jh}+\theta_j

sj(t)=f(netj(t))s_j(t)=f(net_j(t))

2. hidden->output

netk(t)=∑jMsj(t)wkj+θknet_k(t)=\sum_{j}^Ms_j(t)w_{kj}+\theta_k

yk(t)=g(netk(t))y_k(t)=g(net_k(t))

f and g are the activate functions of hidden layer and output layer respectively.

backpropagation

prerequisite

Any network structure can be trained with backpropagation when desired output patterns exist and each function that has been used to calculate the actual output patterns is differentiable.

cost function

1.summed squared error(SSE)

The cost function can be any differentiable function that is able to measure the loss of the predicted values from the gold answers. The SSE is frequently-used, and works well in the training of conventional feed-forward neural networks.

C=12∑lP∑kO(dlk−ylk)2C=\frac{1}{2}\sum_l^P\sum_k^O(d_{lk}-y_{lk})^2

2.cross extropy(CE)

The cross-entropy loss is used in Recurrent Neural Network Language Models(RNNLM) and performs well.

C=−∑lP∑kOdlklnylk+(1−dlk)ln(1−ylk)C=-\sum_l^P\sum_k^Od_{lk}\ln y_{lk}+(1-d_{lk})\ln(1-y_{lk})

Discussion below is based on SSE.

error component

error for output nodes



δlk=−∂C∂netlk=−∂C∂ylk∂ylk∂netlk=(dlk−ylk)g′(ylk)\delta_{lk}=-\frac{\partial C}{\partial net_{lk}}=-\frac{\partial C}{\partial y_{lk}}\frac{\partial y_{lk}}{\partial net_{lk}}=(d_{lk}-y_{lk})g'(y_{lk})

error for hidden nodes



δlj=−(∑kO∂C∂ylk∂ylk∂netlk∂netlk∂slj)∂slj∂netlj=∑kOδlkwkjf′(netlj)\delta_{lj}=-(\sum_k^O\frac{\partial C}{\partial y_{lk}}\frac{\partial y_{lk}}{\partial net_{lk}}\frac{\partial net_{lk}}{\partial s_{lj}})\frac{\partial s_{lj}}{\partial net_{lj}}=\sum_k^O\delta_{lk}w_{kj}f'(net_{lj})

activate function

sigmoid

f(net)=11+e−netf(net)=\frac{1}{1+e^{-net}}

f′(net)=f(net){1−f(net)}f'(net)=f(net)\{1-f(net)\}

softmax

g(netk)=enetk∑Okenetkg(net_k)=\frac{e^{net_k}}{\sum_k^Oe^{net_k}}

g′(netk)=enetk(∑Ojenetj−enetk)(∑Ojenetj)2g'(net_k)=\frac{e^{net_k}(\sum_j^Oe^{net_j}-e^{net_k})}{{(\sum_j^Oe^{net_j}})^2}

gradient descent

According to the gradient descent, each weight change in the network should be proportional to the negative gradient of the cost function, with respect to the speci c weight:

Δw=−η∂C∂w\Delta w=-\eta \frac{\partial C}{\partial w}

where η\eta is the learning rate.

1. hidden->output

Δwkj=−η∂C∂wkj=η∑lP(−∂C∂netlk)∂netlk∂wkj=η∑lPδlk∂netlk∂wkj=η∑lPδlkslj\Delta w_{kj}=-\eta \frac{\partial C}{\partial w_{kj}}=\eta \sum_l^P(-\frac{\partial C}{\partial net_{lk}})\frac{\partial net_{lk}}{\partial w_{kj}}=\eta \sum_l^P\delta_{lk}\frac{\partial net_{lk}}{\partial w_{kj}}=\eta \sum_l^P\delta_{lk}s_{lj}

2. input->hidden

Δvji=−η∂C∂vji=η∑lPδljxli\Delta v_{ji}=-\eta \frac{\partial C}{\partial v_{ji}}=\eta \sum_l^P\delta_{lj}x_{li}

3. previous hidden->hidden

Δujh=−η∂C∂ujh=η∑lPδljs(l−1)h\Delta u_{jh}=-\eta \frac{\partial C}{\partial u_{jh}}=\eta \sum_l^P\delta_{lj}s_{(l-1)h}

unfolding

In a recurrent neural network, errors can be propagated further, i.e. more than 2 layers, in order to capture longer history information. This process is usually called unfolding.

In an unfolded RNN, the recurrent weight is duplicated spatially for an arbitrary number of time steps, here refered to as T.

netlj(t)=∑iNxli(t)vji+∑hMs(l−1)hujh+θjnet_{lj}(t)=\sum_{i}^Nx_{li}(t)v_{ji}+\sum_{h}^Ms_{(l-1)h}u_{jh}+\theta_j

s(l−1)h=f(net(l−1)h)s_{(l-1)h}=f(net_{(l-1)h})



Error for hidden nodes through time as:

δlj(t−1)=−∂C∂net(l−1)j=−∑hM∂C∂netlh∂netlh∂net(l−1)j\delta_{lj}(t-1)=-\frac{\partial C}{\partial net_{(l-1)j}}=-\sum_h^M\frac{\partial C}{\partial net_{lh}}\frac{\partial net_{lh}}{\partial net_{(l-1)j}}

=(−∑hM∂C∂netlh)(∂netlh∂s(l−1)j)(∂s(l−1)j∂net(l−1)j)=(-\sum_h^M\frac{\partial C}{\partial net_{lh}})(\frac{\partial net_{lh}}{\partial s_{(l-1)j}})(\frac{\partial s_{(l-1)j}}{\partial net_{(l-1)j}})

=∑hMδlh(t)uhjf′(net(l−1)j)=\sum_h^M\delta_{lh}(t)u_{hj}f'(net_{(l-1)j})

where h is the index for the hidden node at time step t, and j for the hidden node at time step t-1.

此处原始论文使用的是slj(t−1)s_{lj}(t-1),个人感觉应该是netlj(t−1)net_{lj}(t-1),但是这种表示方式又不好解释,因为tt时刻对应的下标是ll,t−1t-1时刻对应的下标也是ll,所以修改成了net(l−1)jnet_{(l-1)j},认为tt时刻对应的为ll,t−1t-1时刻对应的是l−1l-1.

After all error deltas have been obtained, weights are folded back adding up to one big change for each unfolded weights.

1. input->hidden

Δvji(t)=η∑zT∑lPδlj(t−z)x(l−z)i\Delta v_{ji}(t)=\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)x_{(l-z)i}

2. previous hidden->hidden

Δujh(t)=η∑zT∑lPδlj(t−z)s(l−1−z)h\Delta u_{jh}(t)=\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)s_{(l-1-z)h}

summary

input->hidden

vji(t+1)=vji(t)+η∑zT∑lPδlj(t−z)x(l−z)iv_{ji}(t+1)=v_{ji}(t)+\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)x_{(l-z)i}

previous hidden->hidden

ujh(t+1)=ujh(t)+η∑zT∑lPδlj(t−z)s(l−1−z)hu_{jh}(t+1)=u_{jh}(t)+\eta \sum_z^T\sum_l^P\delta_{lj}(t-z)s_{(l-1-z)h}

hidden->output

wkj(t+1)=wkj(t)+η∑lPδlksljw_{kj}(t+1)=w_{kj}(t)+\eta \sum_l^P\delta_{lk}s_{lj}

references

BackPropagation Through Time

A guide to recurrent neural networks and backpropagation
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: