梯度爆炸(Exploding Gradients)

原文:A Gentle Introduction to Exploding Gradients in Neural Networks

翻译:入门 | 一文了解神经网络中的梯度爆炸(机器之心翻译)


A Gentle Introduction to Exploding Gradients in Neural Networks

by Jason
Brownlee on December
18, 2017 in Long
Short-Term Memory Networks

Exploding gradients are a problem where large error gradients accumulate and result in very large updates to neural network model weights during training.

This has the effect of your model being unstable and unable to learn from your training data.

In this post, you will discover the problem of exploding gradients with deep artificial neural networks.

After completing this post, you will know:

What exploding gradients are and the problems they cause during training.

How to know whether you may have exploding gradients with your network model.

How you can fix the exploding gradient problem with your network.

Let’s get started.

A Gentle Introduction to Exploding Gradients in Recurrent Neural Networks

Photo by Taro
Taylor, some rights reserved.

What Are Exploding Gradients?

An error gradient is the direction and magnitude calculated during the training of a neural network that is used to update the network weights in the right direction and by the right amount.

In deep networks or recurrent neural networks, error gradients can accumulate during an update and result in very large gradients. These in turn result in large updates to the network weights, and in turn, an unstable network. At an extreme, the values of weights
can become so large as to overflow and result in NaN values.

The explosion occurs through exponential growth by repeatedly multiplying gradients through the network layers that have values larger than 1.0.

What Is the Problem with Exploding Gradients?

In deep multilayer Perceptron networks, exploding gradients can result in an unstable network that at best cannot learn from the training data and at worst results in NaN weight values that can no longer be updated.

… exploding gradients can make learning unstable.

— Page 282, Deep Learning, 2016.

In recurrent neural networks, exploding gradients can result in an unstable network that is unable to learn from training data and at best a network that cannot learn over long input sequences of data.

… the exploding gradients problem refers to the large increase in the norm of the gradient during training. Such events are due to the explosion of the long term components

— On the difficulty
of training recurrent neural networks, 2013.

How do You Know if You Have Exploding Gradients?

There are some subtle signs that you may be suffering from exploding gradients during the training of your network, such as:

The model is unable to get traction on your training data (e.g. poor loss).

The model is unstable, resulting in large changes in loss from update to update.

The model loss goes to NaN during training.

If you have these types of problems, you can dig deeper to see if you have a problem with exploding gradients.

There are some less subtle signs that you can use to confirm that you have exploding gradients.

The model weights quickly become very large during training.

The model weights go to NaN values during training.

The error gradient values are consistently above 1.0 for each node and layer during training.

How to Fix Exploding Gradients?

There are many approaches to addressing exploding gradients; this section lists some best practice approaches that you can use.

1. Re-Design the Network Model

In deep neural networks, exploding gradients may be addressed by redesigning the network to have fewer layers.

There may also be some benefit in using a smaller batch size while training the network.

In recurrent neural networks, updating across fewer prior time steps during training, called truncated
Backpropagation through time, may reduce the exploding gradient problem.

2. Use Rectified Linear Activation

In deep multilayer Perceptron neural networks, gradient exploding can occur given the choice of activation function, such as the historically popular sigmoid and tanh functions.

Exploding gradients can be reduced by using the rectified
linear (ReLU) activation function.

Adopting the ReLU activation function is a new best practice for hidden layers.

3. Use Long Short-Term Memory Networks

In recurrent neural networks, gradient exploding can occur given the inherent instability in the training of this type of network, e.g. via Backpropagation through time that essentially transforms the recurrent network into a deep multilayer Perceptron neural

Exploding gradients can be reduced by using the Long
Short-Term Memory (LSTM) memory units and perhaps related gated-type neuron structures.

Adopting LSTM memory units is a new best practice for recurrent neural networks for sequence prediction.

4. Use Gradient Clipping

Exploding gradients can still occur in very deep Multilayer Perceptron networks with a large batch size and LSTMs with very long input sequence lengths.

If exploding gradients are still occurring, you can check for and limit the size of gradients during the training of your network.

This is called gradient clipping.

Dealing with the exploding gradients has a simple but very effective solution: clipping gradients if their norm exceeds a given threshold.

— Section 5.2.4, Vanishing and Exploding Gradients, Neural
Network Methods in Natural Language Processing, 2017.

Specifically, the values of the error gradient are checked against a threshold value and clipped or set to that threshold value if the error gradient exceeds the threshold.

To some extent, the exploding gradient problem can be mitigated by gradient clipping (thresholding the values of the gradients before performing a gradient descent step).

— Page 294, Deep Learning, 2016.

In the Keras deep learning library, you can use gradient clipping by setting the clipnorm or clipvalue arguments
on your optimizer before training.

Good default values are clipnorm=1.0 and clipvalue=0.5.

Usage of optimizers in the Keras API

5. Use Weight Regularization

Another approach, if exploding gradients are still occurring, is to check the size of network weights and apply a penalty to the networks loss function for large weight values.

This is called weight regularization and often an L1 (absolute weights) or an L2 (squared weights) penalty can be used.

Using an L1 or L2 penalty on the recurrent weights can help with exploding gradients

— On the difficulty
of training recurrent neural networks, 2013.

In the Keras deep learning library, you can use weight regularization by setting the kernel_regularizer argument on your layer and using an L1 or L2 regularizer.

Usage of regularizers in the Keras

Further Reading

This section provides more resources on the topic if you are looking to go deeper.


Deep Learning, 2016.

Neural Network Methods in Natural Language
Processing, 2017.


On the difficulty
of training recurrent neural networks, 2013.

Learning long-term
dependencies with gradient descent is difficult, 1994.

the exploding gradient problem, 2012.


is it a problem to have exploding gradients in a neural net (especially in an RNN)?

does LSTM help prevent the vanishing (and exploding) gradient problem in a recurrent neural network?

(neural networks)

Keras API

Usage of optimizers in the Keras API

Usage of regularizers in the Keras


In this post, you discovered the problem of exploding gradients when training deep neural network models.

Specifically, you learned:

What exploding gradients are and the problems they cause during training.

How to know whether you may have exploding gradients with your network model.

How you can fix the exploding gradient problem with your network.

入门 | 一文了解神经网络中的梯度爆炸








在深层网络或循环神经网络中,误差梯度可在更新中累积,变成非常大的梯度,然后导致网络权重的大幅更新,并因此使网络变得不稳定。在极端情况下,权重的值变得非常大,以至于溢出,导致 NaN 值。

网络层之间的梯度(值大于 1.0)重复相乘导致的指数级增长会产生梯度爆炸。


在深度多层感知机网络中,梯度爆炸会引起网络不稳定,最好的结果是无法从训练数据中学习,而最坏的结果是出现无法再更新的 NaN 权重值。








训练过程中,模型损失变成 NaN。




训练过程中模型权重变成 NaN 值。

训练过程中,每个节点和层的误差梯度值持续超过 1.0。



1. 重新设计网络模型



在循环神经网络中,训练过程中在更少的先前时间步上进行更新(沿时间的截断反向传播,truncated Backpropagation through time)可以缓解梯度爆炸问题。

2. 使用 ReLU 激活函数

在深度多层感知机神经网络中,梯度爆炸的发生可能是因为激活函数,如之前很流行的 Sigmoid 和 Tanh 函数。

使用 ReLU 激活函数可以减少梯度爆炸。采用 ReLU 激活函数是最适合隐藏层的新实践。

3. 使用长短期记忆网络



采用 LSTM 单元是适合循环神经网络的序列预测的最新最好实践。

4. 使用梯度截断(Gradient Clipping)

在非常深且批尺寸较大的多层感知机网络和输入序列较长的 LSTM 中,仍然有可能出现梯度爆炸。如果梯度爆炸仍然出现,你可以在训练过程中检查和限制梯度的大小。这就是梯度截断。


 ——《Neural Network Methods in Natural Language Processing》,2017.




在 Keras 深度学习库中,你可以在训练之前设置优化器上的 clipnorm 或 clipvalue 参数,来使用梯度截断。

默认值为 clipnorm=1.0 、clipvalue=0.5。详见:https://keras.io/optimizers/。

5. 使用权重正则化(Weight Regularization)

如果梯度爆炸仍然存在,可以尝试另一种方法,即检查网络权重的大小,并惩罚产生较大权重值的损失函数。该过程被称为权重正则化,通常使用的是 L1 惩罚项(权重绝对值)或 L2 惩罚项(权重平方)。

对循环权重使用 L1 或 L2 惩罚项有助于缓解梯度爆炸。

——On the difficulty of training recurrent neural networks,2013.

在 Keras 深度学习库中,你可以通过在层上设置 kernel_regularizer 参数和使用 L1 或 L2 正则化项进行权重正则化。




Deep Learning, 2016.(http://amzn.to/2fwdoKR)

Neural Network Methods in Natural Language Processing, 2017.(http://amzn.to/2fwTPCn)


On the difficulty of training recurrent neural networks, 2013.(http://proceedings.mlr.press/v28/pascanu13.pdf)

Learning long-term dependencies with gradient descent is difficult, 1994.(http://www.dsi.unifi.it/~paolo/ps/tnn-94-gradient.pdf)

Understanding the exploding gradient problem, 2012.(https://pdfs.semanticscholar.org/728d/814b92a9d2c6118159bb7d9a4b3dc5eeaaeb.pdf)


Why is it a problem to have exploding gradients in a neural net (especially in an RNN)?(https://www.quora.com/Why-is-it-a-problem-to-have-exploding-gradients-in-a-neural-net-especially-in-an-RNN)

How does LSTM help prevent the vanishing (and exploding) gradient problem in a recurrent neural network?(https://www.quora.com/How-does-LSTM-help-prevent-the-vanishing-and-exploding-gradient-problem-in-a-recurrent-neural-network)

Rectifier (neural networks)(https://en.wikipedia.org/wiki/Rectifier_(neural_networks))

Keras API

Usage of optimizers in the Keras API(https://keras.io/optimizers/)

Usage of regularizers in the Keras API(https://keras.io/regularizers/)
