您的位置:首页 > 其它

RNN 入门教程 Part 3 – 介绍 BPTT 算法和梯度消失问题

2016-03-02 12:41 796 查看

转载 - RecurrentNeural Networks Tutorial, Part3 – Backpropagation Through Time and Vanishing Gradients

本文是 RNN入门教程 的第三部分.

In the previous part of the tutorial we implemented a RNN from scratch, butdidn’tgo into detail on how Backpropagation Through Time (BPTT) algorithms calculates the gradients. In this partwe’ll give a brief overview of BPTTand explain how itdiffers from traditional backpropagation. We will then try to understand the vanishing gradientproblem, which has led to the developmentof LSTMs and GRUs, two of the currently mostpopular and powerful models used in NLP (and other areas). The vanishing gradientproblem was originally discovered by Sepp Hochreiter in 1991 and has been receiving attention again recently due to the increased application of deep architectures.

To fully understand this partof the tutorial I recommend being familiar with how partial differentiation and basic backpropagation works. If you are not, you can find excellenttutorials here and here and here, in order of increasing difficulty.

Backpropagation Through Time (BPTT)
Let’s quickly recap the basic equations of our RNN. Note thatthere’s a slightchange in notation from $o$ to $\hat{y}$. That’s only to stay consistentwith some of the literature outthere thatI am referencing.

\[\begin{aligned} s_t&= \tanh(Ux_t+ Ws_{t-1}) \\ \hat{y}_t&= \mathrm{softmax}(Vs_t) \end{aligned} \]

We also defined our loss, or error, to be the cross entropy loss, given by:

\[\begin{aligned} E_t(y_t, \hat{y}_t) &= - y_{t} \log \hat{y}_{t} \\ E(y, \hat{y}) &=\sum\limits_{t} E_t(y_t,\hat{y}_t) \\ & = -\sum\limits_{t} y_{t} \log \hat{y}_{t} \end{aligned} \]

Here, $y_t$ is the correctword attime step $t$, and $\hat{y_t}$ is our prediction. We typically treatthe full sequence (sentence) as one training example, so the total error is justthe sum of the errors ateach time step (word).



Remember thatour goal is to calculate the gradients of the error with respectto our parameters $U,V$ and $W$ and then learn good parameters using Stochastic GradientDescent. Justlike we sum up the errors, we also sum up the gradients ateach time step for one training example: $\frac{\partial E}{\partial W} = \sum\limits_{t} \frac{\partial E_t}{\partial W}$.

To calculate these gradients we use the chain rule of differentiation. That’s the backpropagation algorithm when applied backwards starting from the error. For the restof this postwe’ll use $E_3$ as an example, justto have concrete numbers to work with.

\[\begin{aligned} \frac{\partial E_3}{\partial V} &=\frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial V}\\ &=\frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial z_3}\frac{\partial z_3}{\partial V}\\ &=(\hat{y}_3 - y_3) \otimes s_3 \\ \end{aligned} \]

In the above, $z_3=Vs_3$, and $\otimes $ is the outer productof two vectors. Don’tworry if you don’tfollow the above, I skipped several steps and you can try calculating these derivatives yourself (good exercise!). The pointI’m trying to getacross is that$\frac{\partial E_3}{\partial V} $ only depends on the values atthe currenttime step, $\hat{y}_3, y_3, s_3 $. If you have these, calculating the gradientfor $V$ a simple matrix multiplication.

Butthe story is differentfor $\frac{\partial E_3}{\partial W}$ (and for $U$). To see why, we write outthe chain rule, justas above:

\[\begin{aligned} \frac{\partial E_3}{\partial W} &= \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3}\frac{\partial s_3}{\partial W}\\ \end{aligned} \]

Now, note that$s_3 = \tanh(Ux_t+ Ws_2)$ depends on $s_2$, which depends on $W$ and $s_1$, and so on. So if we take the derivative with respectto $W$ we can’tsimply treat$s_2$ as a constant! We need to apply the chain rule again and whatwe really have is this:

\[\begin{aligned} \frac{\partial E_3}{\partial W} &= \sum\limits_{k=0}^{3} \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k}{\partial W}\\ \end{aligned} \]

We sum up the contributions of each time step to the gradient. In other words, because $W$ is used in every step up to the outputwe care about, we need to backpropagate gradients from $t=3$ through the network all the way to $t=0$:





Note thatthis is exactly the same as the standard backpropagation algorithm thatwe use in deep Feedforward Neural Networks. The key difference is thatwe sum up the gradients for $W$ateach time step. In a traditional NN we don’tshare parameters across layers, so we don’tneed to sum anything. Butin my opinion BPTTis justa fancy name for standard backpropagation on an unrolled RNN. Justlike with Backpropagation you could define a delta vector thatyou pass backwards, e.g.: $\delta_2^{(3)} = \frac{\partial E_3}{\partial z_2} =\frac{\partial E_3}{\partial s_3}\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial z_2}$ with $z_2 = Ux_2+ Ws_1$. Then the same equations will apply.

In code, a naive implementation of BPTTlooks something like this:

def
bptt(
self
, x, y):


T
=
len
(y)


# Perform forward propagation


o, s
=
self
.forward_propagation(x)


# We accumulate the gradients in these variables


dLdU
=
np.zeros(
self
.U.shape)


dLdV
=
np.zeros(
self
.V.shape)


dLdW
=
np.zeros(
self
.W.shape)


delta_o
=
o


delta_o[np.arange(
len
(y)), y]
-
=
1.


# For each outputbackwards...


for
t
in
np.arange(T)[::
-
1
]:


dLdV
+
=
np.outer(delta_o[t], s[t].T)


# Initial delta calculation: dL/dz


delta_t
=
self
.V.T.dot(delta_o[t])
*
(
1
-
(s[t]
*
*
2
))


# Backpropagation through time (for atmostself.bptt_truncate steps)


for
bptt_step
in
np.arange(
max
(
0
, t
-
self
.bptt_truncate), t
+
1
)[::
-
1
]:


# print"Backpropagation step t=%d bpttstep=%d " % (t, bptt_step)


# Add to gradients ateach previous step


dLdW
+
=
np.outer(delta_t, s[bptt_step
-
1
])


dLdU[:,x[bptt_step]]
+
=
delta_t


# Update delta for nextstep dL/dz att-1


delta_t
=
self
.W.T.dot(delta_t)
*
(
1
-
s[bptt_step
-
1
]
*
*
2
)


return
[dLdU, dLdV, dLdW]


This should also give you an idea of why standard RNNs are hard to train: Sequences (sentences) can be quite long, perhaps 20 words or more, and thus you need to back-propagate through many layers. In practice many people truncate the backpropagation to a few steps.

The Vanishing GradientProblem

In previous parts of the tutorial I mentioned thatRNNs have difficulties learning long-range dependencies – interactions between words thatare several steps apart. That’s problematic because the meaning of an English sentence is often determined by words thataren’tvery close: “The man who wore a wig on his head wentinside”. The sentence is really abouta man going inside, notaboutthe wig. Butit’s unlikely thata plain RNN would be able capture such information. To understand why, let’s take a closer look atthe gradientwe calculated above:

\[\begin{aligned} \frac{\partial E_3}{\partial W} &= \sum\limits_{k=0}^{3} \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k}{\partial W}\\ \end{aligned} \]

Note that$\frac{\partial s_3}{\partial s_k} $ is a chain rule in itself! For example, $\frac{\partial s_3}{\partial s_1} =\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial s_1}$. Also note thatbecause we are taking the derivative of a vector function with respectto a vector, the resultis a matrix (called theJacobian matrix) whose elements are all the pointwise derivatives. We can rewrite the above gradient:

\[\begin{aligned} \frac{\partial E_3}{\partial W} &= \sum\limits_{k=0}^{3} \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3} \left(\prod\limits_{j=k+1}^{3} \frac{\partial s_j}{\partial s_{j-1}}\right) \frac{\partial s_k}{\partial W}\\ \end{aligned} \]

Itturns out(I won’tprove ithere butthis paper goes into detail) thatthe 2-norm, which you can think of itas an absolute value, of the above Jacobian matrix has an upper bound of 1. This makes intuitive sense because our $\tanh$ (or sigmoid) activation function maps all values into a range between -1 and 1, and the derivative is bounded by 1 (1/4 in the case of sigmoid) as well:



tanh and derivative.

You can see thatthe $\tanh$ and sigmoid functions have derivatives of 0 atboth ends. They approach a flatline. When this happens we say the corresponding neurons are saturated. They have a zero gradientand drive other gradients in previous layers towards 0. Thus, with small values in the matrix and multiple matrix multiplications ($t-k$ in particular) the gradientvalues are shrinking exponentially fast, eventually vanishing completely after a few time steps. Gradientcontributions from “far away” steps become zero, and the state atthose steps doesn’tcontribute to whatyou are learning: You end up notlearning long-range dependencies. Vanishing gradients aren’texclusive to RNNs. They also happen in deep Feedforward Neural Networks. It’s justthatRNNs tend to be very deep (as deep as the sentence length in our case), which makes the problem a lotmore common.

Itis easy to imagine that, depending on our activation functions and network parameters, we could getexploding instead of vanishing gradients if the values of the Jacobian matrix are large. Indeed, that’s called the exploding gradientproblem. The reason thatvanishing gradients have received more attention than exploding gradients is two-fold. For one, exploding gradients are obvious. Your gradients will become NaN (nota number) and your program will crash. Secondly, clipping the gradients ata pre-defined threshold (as discussed in this paper) is a very simple and effective solution to exploding gradients. Vanishing gradients are more problematic because it’s notobvious when they occur or how to deal with them.

Fortunately, there are a few ways to combatthe vanishing gradientproblem. Proper initialization of the $W$ matrix can reduce the effectof vanishing gradients. So can regularization. A more preferred solution is to use ReLU instead of $\tanh$ or sigmoid activation functions. The ReLU derivative isn’tbounded by 1, so itisn’tas likely to suffer from vanishing gradients. An even more popular solution is to use Long Short-Term Memory (LSTM) or Gated RecurrentUnit(GRU) architectures. LSTMs were firstproposed in 1997 and are the perhaps mostwidely used models in NLP today. GRUs, firstproposed in 2014, are simplified versions of LSTMs. Both of these RNN architectures were explicitly designed to deal with vanishing gradients and efficiently learn long-range dependencies. We’ll cover them in the nextpartof this tutorial.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: