RNN 入门教程 Part 3 – 介绍 BPTT 算法和梯度消失问题
2016-03-02 12:41
796 查看
转载 - RecurrentNeural Networks Tutorial, Part3 – Backpropagation Through Time and Vanishing Gradients
本文是 RNN入门教程 的第三部分.In the previous part of the tutorial we implemented a RNN from scratch, butdidn’tgo into detail on how Backpropagation Through Time (BPTT) algorithms calculates the gradients. In this partwe’ll give a brief overview of BPTTand explain how itdiffers from traditional backpropagation. We will then try to understand the vanishing gradientproblem, which has led to the developmentof LSTMs and GRUs, two of the currently mostpopular and powerful models used in NLP (and other areas). The vanishing gradientproblem was originally discovered by Sepp Hochreiter in 1991 and has been receiving attention again recently due to the increased application of deep architectures.
To fully understand this partof the tutorial I recommend being familiar with how partial differentiation and basic backpropagation works. If you are not, you can find excellenttutorials here and here and here, in order of increasing difficulty.
Backpropagation Through Time (BPTT)
Let’s quickly recap the basic equations of our RNN. Note thatthere’s a slightchange in notation from $o$ to $\hat{y}$. That’s only to stay consistentwith some of the literature outthere thatI am referencing.
\[\begin{aligned} s_t&= \tanh(Ux_t+ Ws_{t-1}) \\ \hat{y}_t&= \mathrm{softmax}(Vs_t) \end{aligned} \]
We also defined our loss, or error, to be the cross entropy loss, given by:
\[\begin{aligned} E_t(y_t, \hat{y}_t) &= - y_{t} \log \hat{y}_{t} \\ E(y, \hat{y}) &=\sum\limits_{t} E_t(y_t,\hat{y}_t) \\ & = -\sum\limits_{t} y_{t} \log \hat{y}_{t} \end{aligned} \]
Here, $y_t$ is the correctword attime step $t$, and $\hat{y_t}$ is our prediction. We typically treatthe full sequence (sentence) as one training example, so the total error is justthe sum of the errors ateach time step (word).
Remember thatour goal is to calculate the gradients of the error with respectto our parameters $U,V$ and $W$ and then learn good parameters using Stochastic GradientDescent. Justlike we sum up the errors, we also sum up the gradients ateach time step for one training example: $\frac{\partial E}{\partial W} = \sum\limits_{t} \frac{\partial E_t}{\partial W}$.
To calculate these gradients we use the chain rule of differentiation. That’s the backpropagation algorithm when applied backwards starting from the error. For the restof this postwe’ll use $E_3$ as an example, justto have concrete numbers to work with.
\[\begin{aligned} \frac{\partial E_3}{\partial V} &=\frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial V}\\ &=\frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial z_3}\frac{\partial z_3}{\partial V}\\ &=(\hat{y}_3 - y_3) \otimes s_3 \\ \end{aligned} \]
In the above, $z_3=Vs_3$, and $\otimes $ is the outer productof two vectors. Don’tworry if you don’tfollow the above, I skipped several steps and you can try calculating these derivatives yourself (good exercise!). The pointI’m trying to getacross is that$\frac{\partial E_3}{\partial V} $ only depends on the values atthe currenttime step, $\hat{y}_3, y_3, s_3 $. If you have these, calculating the gradientfor $V$ a simple matrix multiplication.
Butthe story is differentfor $\frac{\partial E_3}{\partial W}$ (and for $U$). To see why, we write outthe chain rule, justas above:
\[\begin{aligned} \frac{\partial E_3}{\partial W} &= \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3}\frac{\partial s_3}{\partial W}\\ \end{aligned} \]
Now, note that$s_3 = \tanh(Ux_t+ Ws_2)$ depends on $s_2$, which depends on $W$ and $s_1$, and so on. So if we take the derivative with respectto $W$ we can’tsimply treat$s_2$ as a constant! We need to apply the chain rule again and whatwe really have is this:
\[\begin{aligned} \frac{\partial E_3}{\partial W} &= \sum\limits_{k=0}^{3} \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k}{\partial W}\\ \end{aligned} \]
We sum up the contributions of each time step to the gradient. In other words, because $W$ is used in every step up to the outputwe care about, we need to backpropagate gradients from $t=3$ through the network all the way to $t=0$:
Note thatthis is exactly the same as the standard backpropagation algorithm thatwe use in deep Feedforward Neural Networks. The key difference is thatwe sum up the gradients for $W$ateach time step. In a traditional NN we don’tshare parameters across layers, so we don’tneed to sum anything. Butin my opinion BPTTis justa fancy name for standard backpropagation on an unrolled RNN. Justlike with Backpropagation you could define a delta vector thatyou pass backwards, e.g.: $\delta_2^{(3)} = \frac{\partial E_3}{\partial z_2} =\frac{\partial E_3}{\partial s_3}\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial z_2}$ with $z_2 = Ux_2+ Ws_1$. Then the same equations will apply.
In code, a naive implementation of BPTTlooks something like this:
def
bptt(
self
, x, y):
T
=
len
(y)
# Perform forward propagation
o, s
=
self
.forward_propagation(x)
# We accumulate the gradients in these variables
dLdU
=
np.zeros(
self
.U.shape)
dLdV
=
np.zeros(
self
.V.shape)
dLdW
=
np.zeros(
self
.W.shape)
delta_o
=
o
delta_o[np.arange(
len
(y)), y]
-
=
1.
# For each outputbackwards...
for
t
in
np.arange(T)[::
-
1
]:
dLdV
+
=
np.outer(delta_o[t], s[t].T)
# Initial delta calculation: dL/dz
delta_t
=
self
.V.T.dot(delta_o[t])
*
(
1
-
(s[t]
*
*
2
))
# Backpropagation through time (for atmostself.bptt_truncate steps)
for
bptt_step
in
np.arange(
max
(
0
, t
-
self
.bptt_truncate), t
+
1
)[::
-
1
]:
# print"Backpropagation step t=%d bpttstep=%d " % (t, bptt_step)
# Add to gradients ateach previous step
dLdW
+
=
np.outer(delta_t, s[bptt_step
-
1
])
dLdU[:,x[bptt_step]]
+
=
delta_t
# Update delta for nextstep dL/dz att-1
delta_t
=
self
.W.T.dot(delta_t)
*
(
1
-
s[bptt_step
-
1
]
*
*
2
)
return
[dLdU, dLdV, dLdW]
This should also give you an idea of why standard RNNs are hard to train: Sequences (sentences) can be quite long, perhaps 20 words or more, and thus you need to back-propagate through many layers. In practice many people truncate the backpropagation to a few steps.
The Vanishing GradientProblem
In previous parts of the tutorial I mentioned thatRNNs have difficulties learning long-range dependencies – interactions between words thatare several steps apart. That’s problematic because the meaning of an English sentence is often determined by words thataren’tvery close: “The man who wore a wig on his head wentinside”. The sentence is really abouta man going inside, notaboutthe wig. Butit’s unlikely thata plain RNN would be able capture such information. To understand why, let’s take a closer look atthe gradientwe calculated above:\[\begin{aligned} \frac{\partial E_3}{\partial W} &= \sum\limits_{k=0}^{3} \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3}\frac{\partial s_3}{\partial s_k}\frac{\partial s_k}{\partial W}\\ \end{aligned} \]
Note that$\frac{\partial s_3}{\partial s_k} $ is a chain rule in itself! For example, $\frac{\partial s_3}{\partial s_1} =\frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial s_1}$. Also note thatbecause we are taking the derivative of a vector function with respectto a vector, the resultis a matrix (called theJacobian matrix) whose elements are all the pointwise derivatives. We can rewrite the above gradient:
\[\begin{aligned} \frac{\partial E_3}{\partial W} &= \sum\limits_{k=0}^{3} \frac{\partial E_3}{\partial \hat{y}_3}\frac{\partial\hat{y}_3}{\partial s_3} \left(\prod\limits_{j=k+1}^{3} \frac{\partial s_j}{\partial s_{j-1}}\right) \frac{\partial s_k}{\partial W}\\ \end{aligned} \]
Itturns out(I won’tprove ithere butthis paper goes into detail) thatthe 2-norm, which you can think of itas an absolute value, of the above Jacobian matrix has an upper bound of 1. This makes intuitive sense because our $\tanh$ (or sigmoid) activation function maps all values into a range between -1 and 1, and the derivative is bounded by 1 (1/4 in the case of sigmoid) as well:
tanh and derivative.
You can see thatthe $\tanh$ and sigmoid functions have derivatives of 0 atboth ends. They approach a flatline. When this happens we say the corresponding neurons are saturated. They have a zero gradientand drive other gradients in previous layers towards 0. Thus, with small values in the matrix and multiple matrix multiplications ($t-k$ in particular) the gradientvalues are shrinking exponentially fast, eventually vanishing completely after a few time steps. Gradientcontributions from “far away” steps become zero, and the state atthose steps doesn’tcontribute to whatyou are learning: You end up notlearning long-range dependencies. Vanishing gradients aren’texclusive to RNNs. They also happen in deep Feedforward Neural Networks. It’s justthatRNNs tend to be very deep (as deep as the sentence length in our case), which makes the problem a lotmore common.
Itis easy to imagine that, depending on our activation functions and network parameters, we could getexploding instead of vanishing gradients if the values of the Jacobian matrix are large. Indeed, that’s called the exploding gradientproblem. The reason thatvanishing gradients have received more attention than exploding gradients is two-fold. For one, exploding gradients are obvious. Your gradients will become NaN (nota number) and your program will crash. Secondly, clipping the gradients ata pre-defined threshold (as discussed in this paper) is a very simple and effective solution to exploding gradients. Vanishing gradients are more problematic because it’s notobvious when they occur or how to deal with them.
Fortunately, there are a few ways to combatthe vanishing gradientproblem. Proper initialization of the $W$ matrix can reduce the effectof vanishing gradients. So can regularization. A more preferred solution is to use ReLU instead of $\tanh$ or sigmoid activation functions. The ReLU derivative isn’tbounded by 1, so itisn’tas likely to suffer from vanishing gradients. An even more popular solution is to use Long Short-Term Memory (LSTM) or Gated RecurrentUnit(GRU) architectures. LSTMs were firstproposed in 1997 and are the perhaps mostwidely used models in NLP today. GRUs, firstproposed in 2014, are simplified versions of LSTMs. Both of these RNN architectures were explicitly designed to deal with vanishing gradients and efficiently learn long-range dependencies. We’ll cover them in the nextpartof this tutorial.
相关文章推荐
- 高等数学:第六章 定积分的应用(2)平面曲线的弧长 做功 水压力 引力
- matlab文件读取和写入
- java中的String、StringBuilder、StringBuffer
- Educational Codeforces Round 9 -- B - Alice, Bob, Two Teams
- iOS大转盘抽奖
- Atom 炫酷插件 Activate-power-mode 安装
- iOS 拨打电话功能
- jenkins初探
- 高等数学:第六章 定积分的应用(1)定积分的应用 平面图形的面积 立体体积
- xcode7真机测试swift出现错误no suitable image found
- linux服务器subversionSVN安装配置及windows客户端TortoiseSVN使用教程
- ListView中同时开启多个倒计时
- Spring 4.2.4.RELEASE MVC 学习笔记 - 6.1 - (咋个办呢 zgbn)
- 学习linux决心书
- HDU 1298:T9【字典树+dfs】
- Tomcat 7.0.50 Coyote 连接器类结构和架构设计简析
- 高等数学:第五章 定积分(2)换元积分法 分部积分法 广义积分
- 学习liunx决心书
- 远程磁盘映射
- 做游戏加速的那些事