深度信念网络Deep Belief Networks
2016-04-02 19:08
567 查看
Note
This section assumes the reader has already read through Classifying
MNIST digits using Logistic Regression and Multilayer Perceptron andRestricted
Boltzmann Machines (RBM). Additionally it uses the following Theano functions and concepts : T.tanh, shared
variables, basic arithmetic ops, T.grad, Random
numbers, floatX. If you intend to run the code on GPU also read GPU.
Note
The code for this section is available for download here.
in a greedy manner to form so-called Deep Belief Networks (DBN). DBNs are graphical models which learn to extract a deep hierarchical representation of the training data. They model the joint distribution between observed vector
![](http://deeplearning.net/tutorial/_images/math/5fea02fa2a6372f999ae409954f23bba35f00b77.png)
and
the
![](http://deeplearning.net/tutorial/_images/math/4d7389e651f37e009cc84ec6e5be5e293196caf2.png)
hidden layers
![](http://deeplearning.net/tutorial/_images/math/fbe196988ca90a537a8e45cf8b8fb6b31287d37e.png)
as
follows:
(1)
![](http://deeplearning.net/tutorial/_images/math/fdf4545d33792b5ad3cad7aa1a713456ee4a97a7.png)
where
![](http://deeplearning.net/tutorial/_images/math/acefc66b1fbf8c670bb1a857623a9fc96711b2d5.png)
,
![](http://deeplearning.net/tutorial/_images/math/21d6f51229778abe09ff937202fefb0fcfda2c1a.png)
is
a conditional distribution for the visible units conditioned on the hidden units of the RBM at level
![](http://deeplearning.net/tutorial/_images/math/06f3d34d39d42ff670798396574715174df4cf31.png)
,
and
![](http://deeplearning.net/tutorial/_images/math/a6dbfe591162ffa4398ccae83990b54177da4422.png)
is the visible-hidden joint distribution in the
top-level RBM. This is illustrated in the figure below.
![](http://deeplearning.net/tutorial/_images/DBN3.png)
The principle of greedy layer-wise unsupervised training can be applied to DBNs with RBMs as the building blocks for each layer[Hinton06], [Bengio07].
The process is as follows:
1. Train the first layer as an RBM that models the raw input
![](http://deeplearning.net/tutorial/_images/math/4bf1d056e09de569cfa36d234578d6c3d387752b.png)
as
its visible layer.
2. Use that first layer to obtain a representation of the input that will be used as data for the second layer. Two common solutions exist. This representation can be chosen as being the mean activations
![](http://deeplearning.net/tutorial/_images/math/8db775341a00d3b7cbc8813babe734b2d0e5f30c.png)
or
samples of
![](http://deeplearning.net/tutorial/_images/math/94c3565092d05e5c82290532740d2878e5ee55ce.png)
.
3. Train the second layer as an RBM, taking the transformed data (samples or mean activations) as training examples (for the visible layer of that RBM).
4. Iterate (2 and 3) for the desired number of layers, each time propagating upward either samples or mean values.
5. Fine-tune all the parameters of this deep architecture with respect to a proxy for the DBN log- likelihood, or with respect to a supervised training criterion (after adding extra learning machinery to convert
the learned representation into supervised predictions, e.g. a linear classifier).
In this tutorial, we focus on fine-tuning via supervised gradient descent. Specifically, we use a logistic regression classifier to classify the input
![](http://deeplearning.net/tutorial/_images/math/5fea02fa2a6372f999ae409954f23bba35f00b77.png)
based
on the output of the last hidden layer
![](http://deeplearning.net/tutorial/_images/math/43b50f021e3f3b4503f6e1922e376d5436ff0612.png)
of the DBN. Fine-tuning is then
performed via supervised gradient descent of the negative log-likelihood cost function. Since the supervised gradient is only non-null for the weights and hidden layer biases of each layer (i.e. null for the visible biases of each RBM), this procedure is equivalent
to initializing the parameters of a deep MLP with the weights and hidden layer biases obtained with the unsupervised training strategy.
![](http://deeplearning.net/tutorial/_images/math/5b706c381cba39c19438cb3819420018913bd964.png)
and
![](http://deeplearning.net/tutorial/_images/math/fd8d6d2f492eda944505070397041494751a1e3c.png)
(with
respective weight parameters
![](http://deeplearning.net/tutorial/_images/math/82dde8ffb50df004a6394749fdb4edbebe029cfc.png)
and
![](http://deeplearning.net/tutorial/_images/math/040bedff85911bfe5c49b9ba95d43202c223b9db.png)
), [Hinton06] established
(see also Bengio09]_ for a detailed derivation) that
![](http://deeplearning.net/tutorial/_images/math/d2835ef29d9b6d9cf3877c1598eb6e1395ed9a59.png)
can be rewritten
as,
(2)
![](http://deeplearning.net/tutorial/_images/math/a73d9fb152886bd4d26e8a8c1213a44aa97ac1d4.png)
![](http://deeplearning.net/tutorial/_images/math/2f51667d937e72e22f3ddda9ea21cf1bbf68abc0.png)
represents
the KL divergence between the posterior
![](http://deeplearning.net/tutorial/_images/math/7a40e64b37ec5a9d90585e1ac1787996fe5f3cce.png)
of the first RBM if it
were standalone, and the probability
![](http://deeplearning.net/tutorial/_images/math/798da0b4212313939f4e6760a2e89ea224d18791.png)
for the same layer but defined
by the entire DBN (i.e. taking into account the prior
![](http://deeplearning.net/tutorial/_images/math/567ccb4ccbf8b6a29d2c1d046d7f9eeab3e50dcd.png)
defined
by the top-level RBM).
![](http://deeplearning.net/tutorial/_images/math/43909cce393e573b7b98864b41d0530992eb8463.png)
is the entropy of the distribution
![](http://deeplearning.net/tutorial/_images/math/7a40e64b37ec5a9d90585e1ac1787996fe5f3cce.png)
.
It can be shown that if we initialize both hidden layers such that
![](http://deeplearning.net/tutorial/_images/math/9332ed52674df65b414d19af23f00e227cd39dbe.png)
,
![](http://deeplearning.net/tutorial/_images/math/50a780ed538c1b0fd373165ec61a518c3e7d773f.png)
and
the KL divergence term is null. If we learn the first level RBM and then keep its parameters
![](http://deeplearning.net/tutorial/_images/math/82dde8ffb50df004a6394749fdb4edbebe029cfc.png)
fixed,
optimizing Eq. (2) with respect to
![](http://deeplearning.net/tutorial/_images/math/040bedff85911bfe5c49b9ba95d43202c223b9db.png)
can
thus only increase the likelihood
![](http://deeplearning.net/tutorial/_images/math/ddc590905fcd926b8b9278a53730920e261ca17f.png)
.
Also, notice that if we isolate the terms which depend only on
![](http://deeplearning.net/tutorial/_images/math/040bedff85911bfe5c49b9ba95d43202c223b9db.png)
,
we get:
![](http://deeplearning.net/tutorial/_images/math/b41583791de12bf51d7417d36608e1a69fa8ab92.png)
Optimizing this with respect to
![](http://deeplearning.net/tutorial/_images/math/040bedff85911bfe5c49b9ba95d43202c223b9db.png)
amounts
to training a second-stage RBM, using the output of
![](http://deeplearning.net/tutorial/_images/math/7a40e64b37ec5a9d90585e1ac1787996fe5f3cce.png)
as the training
distribution, when
![](http://deeplearning.net/tutorial/_images/math/5fea02fa2a6372f999ae409954f23bba35f00b77.png)
is sampled from the training distribution for the first
RBM.
Machines (RBM) tutorial. One can also observe that the code for the DBN is very similar with the one for SdA, because both involve the principle of unsupervised layer-wise pre-training followed by supervised fine-tuning as a deep MLP. The main difference
is that we use the RBM class instead of the dA class.
We start off by defining the DBN class which will store the layers of the MLP, along with their associated RBMs. Since we take the viewpoint of using the RBMs to initialize an MLP, the code will reflect this
by seperating as much as possible the RBMs used to initialize the network and the MLP used for classification.
store the feed-forward graphs which together form the MLP, while
store the RBMs used to pretrain each layer of the MLP.
Next step, we construct
layers (we use the
introduced in Multilayer Perceptron, with the only modification that we replaced the non-linearity from
the logistic function
![](http://deeplearning.net/tutorial/_images/math/f6e02991c908f6b7a39b37a37fc64d94fc646957.png)
) and
where
the depth of our model. We link the sigmoid layers such that they form an MLP, and construct each RBM such that they share the weight matrix and the hidden bias with its corresponding sigmoid layer.
All that is left is to stack one last logistic regression layer in order to form an MLP. We will use the
introduced in Classifying MNIST digits using Logistic Regression.
The class also provides a method which generates training functions for each of the RBMs. They are returned as a list, where element
![](http://deeplearning.net/tutorial/_images/math/21d98334101b86128698b3b3e441168f62e89905.png)
is
a function which implements one step of training for the
layer
![](http://deeplearning.net/tutorial/_images/math/21d98334101b86128698b3b3e441168f62e89905.png)
.
In order to be able to change the learning rate during training, we associate a Theano variable to it that has a default value.
Now any function
as arguments
optionally
the learning rate. Note that the names of the parameters are the names given to the Theano variables (e.g.
when they are constructed and not the name of the python variables (e.g.
Keep this in mind when working with Theano. Optionally, if you provide
number of Gibbs steps to perform in CD or PCD) this will also become an argument of your function.
In the same fashion, the DBN class includes a method for building the functions required for finetuning ( a
a
a
Note that the returned
not Theano functions, but rather Python functions. These loop over the entire validation set and the entire test set to produce a list of the losses obtained over these sets.
There are two stages in training this network: (1) a layer-wise pre-training and (2) a fine-tuning stage.
For the pre-training stage, we loop over all the layers of the network. For each layer, we use the compiled theano function which determines the input to the
level RBM and performs one step of CD-k within this RBM. This function is applied to the training set for a fixed number of epochs given by
The fine-tuning loop is very similar to the one in the Multilayer Perceptron tutorial,
the only difference being that we now use the functions given by
With the default parameters, the code runs for 100 pre-training epochs with mini-batches of size 10. This corresponds to performing 500,000 unsupervised parameter updates. We use an unsupervised learning rate
of 0.01, with a supervised learning rate of 0.1. The DBN itself consists of three hidden layers with 1000 units per layer. With early-stopping, this configuration achieved a minimal validation error of 1.27 with corresponding test error of 1.34 after 46 supervised
epochs.
On an Intel(R) Xeon(R) CPU X5560 running at 2.80GHz, using a multi-threaded MKL library (running on 4 cores), pretraining took 615 minutes with an average of 2.05 mins/(layer * epoch). Fine-tuning took only 101
minutes or approximately 2.20 mins/epoch.
Hyper-parameters were selected by optimizing on the validation error. We tested unsupervised learning rates in
![](http://deeplearning.net/tutorial/_images/math/feebd65c1e5bfd7c72b68c7553b5d236844e6318.png)
and
supervised learning rates in
![](http://deeplearning.net/tutorial/_images/math/ea7c21b8241ad524b3c2b63848d5fc8d9b0ecb66.png)
. We did not use any form
of regularization besides early-stopping, nor did we optimize over the number of pretraining updates.
a single pass, once the weights of the
![](http://deeplearning.net/tutorial/_images/math/aa8eea5e2131db779d763f6a20fc22e166e05b1d.png)
-th layers have been fixed. Namely,
start by training your first layer RBM. Once it is trained, you can compute the hidden units values for every example in the dataset and store this as a new dataset which is used to train the 2nd layer RBM. Once you trained the RBM for layer 2, you compute,
in a similar fashion, the dataset for layer 3 and so on. This avoids calculating the intermediate (hidden layer) representations,
at the expense of increased memory usage.
from: http://deeplearning.net/tutorial/DBN.html
This section assumes the reader has already read through Classifying
MNIST digits using Logistic Regression and Multilayer Perceptron andRestricted
Boltzmann Machines (RBM). Additionally it uses the following Theano functions and concepts : T.tanh, shared
variables, basic arithmetic ops, T.grad, Random
numbers, floatX. If you intend to run the code on GPU also read GPU.
Note
The code for this section is available for download here.
Deep Belief Networks
[Hinton06] showed that RBMs can be stacked and trainedin a greedy manner to form so-called Deep Belief Networks (DBN). DBNs are graphical models which learn to extract a deep hierarchical representation of the training data. They model the joint distribution between observed vector
![](http://deeplearning.net/tutorial/_images/math/5fea02fa2a6372f999ae409954f23bba35f00b77.png)
and
the
![](http://deeplearning.net/tutorial/_images/math/4d7389e651f37e009cc84ec6e5be5e293196caf2.png)
hidden layers
![](http://deeplearning.net/tutorial/_images/math/fbe196988ca90a537a8e45cf8b8fb6b31287d37e.png)
as
follows:
(1)
![](http://deeplearning.net/tutorial/_images/math/fdf4545d33792b5ad3cad7aa1a713456ee4a97a7.png)
where
![](http://deeplearning.net/tutorial/_images/math/acefc66b1fbf8c670bb1a857623a9fc96711b2d5.png)
,
![](http://deeplearning.net/tutorial/_images/math/21d6f51229778abe09ff937202fefb0fcfda2c1a.png)
is
a conditional distribution for the visible units conditioned on the hidden units of the RBM at level
![](http://deeplearning.net/tutorial/_images/math/06f3d34d39d42ff670798396574715174df4cf31.png)
,
and
![](http://deeplearning.net/tutorial/_images/math/a6dbfe591162ffa4398ccae83990b54177da4422.png)
is the visible-hidden joint distribution in the
top-level RBM. This is illustrated in the figure below.
![](http://deeplearning.net/tutorial/_images/DBN3.png)
The principle of greedy layer-wise unsupervised training can be applied to DBNs with RBMs as the building blocks for each layer[Hinton06], [Bengio07].
The process is as follows:
1. Train the first layer as an RBM that models the raw input
![](http://deeplearning.net/tutorial/_images/math/4bf1d056e09de569cfa36d234578d6c3d387752b.png)
as
its visible layer.
2. Use that first layer to obtain a representation of the input that will be used as data for the second layer. Two common solutions exist. This representation can be chosen as being the mean activations
![](http://deeplearning.net/tutorial/_images/math/8db775341a00d3b7cbc8813babe734b2d0e5f30c.png)
or
samples of
![](http://deeplearning.net/tutorial/_images/math/94c3565092d05e5c82290532740d2878e5ee55ce.png)
.
3. Train the second layer as an RBM, taking the transformed data (samples or mean activations) as training examples (for the visible layer of that RBM).
4. Iterate (2 and 3) for the desired number of layers, each time propagating upward either samples or mean values.
5. Fine-tune all the parameters of this deep architecture with respect to a proxy for the DBN log- likelihood, or with respect to a supervised training criterion (after adding extra learning machinery to convert
the learned representation into supervised predictions, e.g. a linear classifier).
In this tutorial, we focus on fine-tuning via supervised gradient descent. Specifically, we use a logistic regression classifier to classify the input
![](http://deeplearning.net/tutorial/_images/math/5fea02fa2a6372f999ae409954f23bba35f00b77.png)
based
on the output of the last hidden layer
![](http://deeplearning.net/tutorial/_images/math/43b50f021e3f3b4503f6e1922e376d5436ff0612.png)
of the DBN. Fine-tuning is then
performed via supervised gradient descent of the negative log-likelihood cost function. Since the supervised gradient is only non-null for the weights and hidden layer biases of each layer (i.e. null for the visible biases of each RBM), this procedure is equivalent
to initializing the parameters of a deep MLP with the weights and hidden layer biases obtained with the unsupervised training strategy.
Justifying Greedy-Layer Wise Pre-Training
Why does such an algorithm work ? Taking as example a 2-layer DBN with hidden layers![](http://deeplearning.net/tutorial/_images/math/5b706c381cba39c19438cb3819420018913bd964.png)
and
![](http://deeplearning.net/tutorial/_images/math/fd8d6d2f492eda944505070397041494751a1e3c.png)
(with
respective weight parameters
![](http://deeplearning.net/tutorial/_images/math/82dde8ffb50df004a6394749fdb4edbebe029cfc.png)
and
![](http://deeplearning.net/tutorial/_images/math/040bedff85911bfe5c49b9ba95d43202c223b9db.png)
), [Hinton06] established
(see also Bengio09]_ for a detailed derivation) that
![](http://deeplearning.net/tutorial/_images/math/d2835ef29d9b6d9cf3877c1598eb6e1395ed9a59.png)
can be rewritten
as,
(2)
![](http://deeplearning.net/tutorial/_images/math/a73d9fb152886bd4d26e8a8c1213a44aa97ac1d4.png)
![](http://deeplearning.net/tutorial/_images/math/2f51667d937e72e22f3ddda9ea21cf1bbf68abc0.png)
represents
the KL divergence between the posterior
![](http://deeplearning.net/tutorial/_images/math/7a40e64b37ec5a9d90585e1ac1787996fe5f3cce.png)
of the first RBM if it
were standalone, and the probability
![](http://deeplearning.net/tutorial/_images/math/798da0b4212313939f4e6760a2e89ea224d18791.png)
for the same layer but defined
by the entire DBN (i.e. taking into account the prior
![](http://deeplearning.net/tutorial/_images/math/567ccb4ccbf8b6a29d2c1d046d7f9eeab3e50dcd.png)
defined
by the top-level RBM).
![](http://deeplearning.net/tutorial/_images/math/43909cce393e573b7b98864b41d0530992eb8463.png)
is the entropy of the distribution
![](http://deeplearning.net/tutorial/_images/math/7a40e64b37ec5a9d90585e1ac1787996fe5f3cce.png)
.
It can be shown that if we initialize both hidden layers such that
![](http://deeplearning.net/tutorial/_images/math/9332ed52674df65b414d19af23f00e227cd39dbe.png)
,
![](http://deeplearning.net/tutorial/_images/math/50a780ed538c1b0fd373165ec61a518c3e7d773f.png)
and
the KL divergence term is null. If we learn the first level RBM and then keep its parameters
![](http://deeplearning.net/tutorial/_images/math/82dde8ffb50df004a6394749fdb4edbebe029cfc.png)
fixed,
optimizing Eq. (2) with respect to
![](http://deeplearning.net/tutorial/_images/math/040bedff85911bfe5c49b9ba95d43202c223b9db.png)
can
thus only increase the likelihood
![](http://deeplearning.net/tutorial/_images/math/ddc590905fcd926b8b9278a53730920e261ca17f.png)
.
Also, notice that if we isolate the terms which depend only on
![](http://deeplearning.net/tutorial/_images/math/040bedff85911bfe5c49b9ba95d43202c223b9db.png)
,
we get:
![](http://deeplearning.net/tutorial/_images/math/b41583791de12bf51d7417d36608e1a69fa8ab92.png)
Optimizing this with respect to
![](http://deeplearning.net/tutorial/_images/math/040bedff85911bfe5c49b9ba95d43202c223b9db.png)
amounts
to training a second-stage RBM, using the output of
![](http://deeplearning.net/tutorial/_images/math/7a40e64b37ec5a9d90585e1ac1787996fe5f3cce.png)
as the training
distribution, when
![](http://deeplearning.net/tutorial/_images/math/5fea02fa2a6372f999ae409954f23bba35f00b77.png)
is sampled from the training distribution for the first
RBM.
Implementation
To implement DBNs in Theano, we will use the class defined in the Restricted BoltzmannMachines (RBM) tutorial. One can also observe that the code for the DBN is very similar with the one for SdA, because both involve the principle of unsupervised layer-wise pre-training followed by supervised fine-tuning as a deep MLP. The main difference
is that we use the RBM class instead of the dA class.
We start off by defining the DBN class which will store the layers of the MLP, along with their associated RBMs. Since we take the viewpoint of using the RBMs to initialize an MLP, the code will reflect this
by seperating as much as possible the RBMs used to initialize the network and the MLP used for classification.
class DBN(object): """Deep Belief Network A deep belief network is obtained by stacking several RBMs on top of each other. The hidden layer of the RBM at layer `i` becomes the input of the RBM at layer `i+1`. The first layer RBM gets as input the input of the network, and the hidden layer of the last RBM represents the output. When used for classification, the DBN is treated as a MLP, by adding a logistic regression layer on top. """ def __init__(self, numpy_rng, theano_rng=None, n_ins=784, hidden_layers_sizes=[500, 500], n_outs=10): """This class is made to support a variable number of layers. :type numpy_rng: numpy.random.RandomState :param numpy_rng: numpy random number generator used to draw initial weights :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams :param theano_rng: Theano random generator; if None is given one is generated based on a seed drawn from `rng` :type n_ins: int :param n_ins: dimension of the input to the DBN :type hidden_layers_sizes: list of ints :param hidden_layers_sizes: intermediate layers size, must contain at least one value :type n_outs: int :param n_outs: dimension of the output of the network """ self.sigmoid_layers = [] self.rbm_layers = [] self.params = [] self.n_layers = len(hidden_layers_sizes) assert self.n_layers > 0 if not theano_rng: theano_rng = MRG_RandomStreams(numpy_rng.randint(2 ** 30)) # allocate symbolic variables for the data self.x = T.matrix('x') # the data is presented as rasterized images self.y = T.ivector('y') # the labels are presented as 1D vector # of [int] labels
self.sigmoid_layerswill
store the feed-forward graphs which together form the MLP, while
self.rbm_layerswill
store the RBMs used to pretrain each layer of the MLP.
Next step, we construct
n_layerssigmoid
layers (we use the
HiddenLayerclass
introduced in Multilayer Perceptron, with the only modification that we replaced the non-linearity from
tanhto
the logistic function
![](http://deeplearning.net/tutorial/_images/math/f6e02991c908f6b7a39b37a37fc64d94fc646957.png)
) and
n_layersRBMs,
where
n_layersis
the depth of our model. We link the sigmoid layers such that they form an MLP, and construct each RBM such that they share the weight matrix and the hidden bias with its corresponding sigmoid layer.
for i in range(self.n_layers): # construct the sigmoidal layer # the size of the input is either the number of hidden # units of the layer below or the input size if we are on # the first layer if i == 0: input_size = n_ins else: input_size = hidden_layers_sizes[i - 1] # the input to this layer is either the activation of the # hidden layer below or the input of the DBN if you are on # the first layer if i == 0: layer_input = self.x else: layer_input = self.sigmoid_layers[-1].output sigmoid_layer = HiddenLayer(rng=numpy_rng, input=layer_input, n_in=input_size, n_out=hidden_layers_sizes[i], activation=T.nnet.sigmoid) # add the layer to our list of layers self.sigmoid_layers.append(sigmoid_layer) # its arguably a philosophical question... but we are # going to only declare that the parameters of the # sigmoid_layers are parameters of the DBN. The visible # biases in the RBM are parameters of those RBMs, but not # of the DBN. self.params.extend(sigmoid_layer.params) # Construct an RBM that shared weights with this layer rbm_layer = RBM(numpy_rng=numpy_rng, theano_rng=theano_rng, input=layer_input, n_visible=input_size, n_hidden=hidden_layers_sizes[i], W=sigmoid_layer.W, hbias=sigmoid_layer.b) self.rbm_layers.append(rbm_layer)
All that is left is to stack one last logistic regression layer in order to form an MLP. We will use the
LogisticRegressionclass
introduced in Classifying MNIST digits using Logistic Regression.
self.logLayer = LogisticRegression( input=self.sigmoid_layers[-1].output, n_in=hidden_layers_sizes[-1], n_out=n_outs) self.params.extend(self.logLayer.params) # compute the cost for second phase of training, defined as the # negative log likelihood of the logistic regression (output) layer self.finetune_cost = self.logLayer.negative_log_likelihood(self.y) # compute the gradients with respect to the model parameters # symbolic variable that points to the number of errors made on the # minibatch given by self.x and self.y self.errors = self.logLayer.errors(self.y)
The class also provides a method which generates training functions for each of the RBMs. They are returned as a list, where element
![](http://deeplearning.net/tutorial/_images/math/21d98334101b86128698b3b3e441168f62e89905.png)
is
a function which implements one step of training for the
RBMat
layer
![](http://deeplearning.net/tutorial/_images/math/21d98334101b86128698b3b3e441168f62e89905.png)
.
def pretraining_functions(self, train_set_x, batch_size, k): '''Generates a list of functions, for performing one step of gradient descent at a given layer. The function will require as input the minibatch index, and to train an RBM you just need to iterate, calling the corresponding function on all minibatch indexes. :type train_set_x: theano.tensor.TensorType :param train_set_x: Shared var. that contains all datapoints used for training the RBM :type batch_size: int :param batch_size: size of a [mini]batch :param k: number of Gibbs steps to do in CD-k / PCD-k ''' # index to a [mini]batch index = T.lscalar('index') # index to a minibatch
In order to be able to change the learning rate during training, we associate a Theano variable to it that has a default value.
learning_rate = T.scalar('lr') # learning rate to use # number of batches n_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size # begining of a batch, given `index` batch_begin = index * batch_size # ending of a batch given `index` batch_end = batch_begin + batch_size pretrain_fns = [] for rbm in self.rbm_layers: # get the cost and the updates list # using CD-k here (persisent=None) for training each RBM. # TODO: change cost function to reconstruction error cost, updates = rbm.get_cost_updates(learning_rate, persistent=None, k=k) # compile the theano function fn = theano.function( inputs=[index, theano.In(learning_rate, value=0.1)], outputs=cost, updates=updates, givens={ self.x: train_set_x[batch_begin:batch_end] } ) # append `fn` to the list of functions pretrain_fns.append(fn) return pretrain_fns
Now any function
pretrain_fns[i]takes
as arguments
indexand
optionally
lr–
the learning rate. Note that the names of the parameters are the names given to the Theano variables (e.g.
lr)
when they are constructed and not the name of the python variables (e.g.
learning_rate).
Keep this in mind when working with Theano. Optionally, if you provide
k(the
number of Gibbs steps to perform in CD or PCD) this will also become an argument of your function.
In the same fashion, the DBN class includes a method for building the functions required for finetuning ( a
train_model,
a
validate_modeland
a
test_modelfunction).
def build_finetune_functions(self, datasets, batch_size, learning_rate): '''Generates a function `train` that implements one step of finetuning, a function `validate` that computes the error on a batch from the validation set, and a function `test` that computes the error on a batch from the testing set :type datasets: list of pairs of theano.tensor.TensorType :param datasets: It is a list that contain all the datasets; the has to contain three pairs, `train`, `valid`, `test` in this order, where each pair is formed of two Theano variables, one for the datapoints, the other for the labels :type batch_size: int :param batch_size: size of a minibatch :type learning_rate: float :param learning_rate: learning rate used during finetune stage ''' (train_set_x, train_set_y) = datasets[0] (valid_set_x, valid_set_y) = datasets[1] (test_set_x, test_set_y) = datasets[2] # compute number of minibatches for training, validation and testing n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] n_valid_batches /= batch_size n_test_batches = test_set_x.get_value(borrow=True).shape[0] n_test_batches /= batch_size index = T.lscalar('index') # index to a [mini]batch # compute the gradients with respect to the model parameters gparams = T.grad(self.finetune_cost, self.params) # compute list of fine-tuning updates updates = [] for param, gparam in zip(self.params, gparams): updates.append((param, param - gparam * learning_rate)) train_fn = theano.function( inputs=[index], outputs=self.finetune_cost, updates=updates, givens={ self.x: train_set_x[ index * batch_size: (index + 1) * batch_size ], self.y: train_set_y[ index * batch_size: (index + 1) * batch_size ] } ) test_score_i = theano.function( [index], self.errors, givens={ self.x: test_set_x[ index * batch_size: (index + 1) * batch_size ], self.y: test_set_y[ index * batch_size: (index + 1) * batch_size ] } ) valid_score_i = theano.function( [index], self.errors, givens={ self.x: valid_set_x[ index * batch_size: (index + 1) * batch_size ], self.y: valid_set_y[ index * batch_size: (index + 1) * batch_size ] } ) # Create a function that scans the entire validation set def valid_score(): return [valid_score_i(i) for i in range(n_valid_batches)] # Create a function that scans the entire test set def test_score(): return [test_score_i(i) for i in range(n_test_batches)] return train_fn, valid_score, test_score
Note that the returned
valid_scoreand
test_scoreare
not Theano functions, but rather Python functions. These loop over the entire validation set and the entire test set to produce a list of the losses obtained over these sets.
Putting it all together
The few lines of code below constructs the deep belief network :numpy_rng = numpy.random.RandomState(123) print '... building the model' # construct the Deep Belief Network dbn = DBN(numpy_rng=numpy_rng, n_ins=28 * 28, hidden_layers_sizes=[1000, 1000, 1000], n_outs=10)
There are two stages in training this network: (1) a layer-wise pre-training and (2) a fine-tuning stage.
For the pre-training stage, we loop over all the layers of the network. For each layer, we use the compiled theano function which determines the input to the
i-th
level RBM and performs one step of CD-k within this RBM. This function is applied to the training set for a fixed number of epochs given by
pretraining_epochs.
######################### # PRETRAINING THE MODEL # ######################### print '... getting the pretraining functions' pretraining_fns = dbn.pretraining_functions(train_set_x=train_set_x, batch_size=batch_size, k=k) print '... pre-training the model' start_time = timeit.default_timer() ## Pre-train layer-wise for i in range(dbn.n_layers): # go through pretraining epochs for epoch in range(pretraining_epochs): # go through the training set c = [] for batch_index in range(n_train_batches): c.append(pretraining_fns[i](index=batch_index, lr=pretrain_lr)) print 'Pre-training layer %i, epoch %d, cost ' % (i, epoch), print numpy.mean(c) end_time = timeit.default_timer()
The fine-tuning loop is very similar to the one in the Multilayer Perceptron tutorial,
the only difference being that we now use the functions given by
build_finetune_functions.
Running the Code
The user can run the code by calling:python code/DBN.py
With the default parameters, the code runs for 100 pre-training epochs with mini-batches of size 10. This corresponds to performing 500,000 unsupervised parameter updates. We use an unsupervised learning rate
of 0.01, with a supervised learning rate of 0.1. The DBN itself consists of three hidden layers with 1000 units per layer. With early-stopping, this configuration achieved a minimal validation error of 1.27 with corresponding test error of 1.34 after 46 supervised
epochs.
On an Intel(R) Xeon(R) CPU X5560 running at 2.80GHz, using a multi-threaded MKL library (running on 4 cores), pretraining took 615 minutes with an average of 2.05 mins/(layer * epoch). Fine-tuning took only 101
minutes or approximately 2.20 mins/epoch.
Hyper-parameters were selected by optimizing on the validation error. We tested unsupervised learning rates in
![](http://deeplearning.net/tutorial/_images/math/feebd65c1e5bfd7c72b68c7553b5d236844e6318.png)
and
supervised learning rates in
![](http://deeplearning.net/tutorial/_images/math/ea7c21b8241ad524b3c2b63848d5fc8d9b0ecb66.png)
. We did not use any form
of regularization besides early-stopping, nor did we optimize over the number of pretraining updates.
Tips and Tricks
One way to improve the running time of your code (given that you have sufficient memory available), is to compute the representation of the entire dataset at layeriin
a single pass, once the weights of the
![](http://deeplearning.net/tutorial/_images/math/aa8eea5e2131db779d763f6a20fc22e166e05b1d.png)
-th layers have been fixed. Namely,
start by training your first layer RBM. Once it is trained, you can compute the hidden units values for every example in the dataset and store this as a new dataset which is used to train the 2nd layer RBM. Once you trained the RBM for layer 2, you compute,
in a similar fashion, the dataset for layer 3 and so on. This avoids calculating the intermediate (hidden layer) representations,
pretraining_epochstimes
at the expense of increased memory usage.
from: http://deeplearning.net/tutorial/DBN.html
相关文章推荐
- 神经网络,流形和拓扑Neural Networks, Manifolds, and Topology
- Android 网络通信框架Volley的二次封装
- 对TCP/IP网络协议的深入浅出归纳
- HTTP协议
- HttpClient设置请求头
- HTTP总结
- 《图解HTTP》
- 建立神经网络:Part 0
- linux下lighttpdserver的具体安装步骤 以及对flv流媒体的支持配置
- TCP建立连接和断开连接过程
- Wireshark-TCP协议分析(包结构以及连接的建立和释放)
- 机器学习:贝叶斯网络入门
- 计算机网络overview-1
- Tomcat配置https
- 【计算机网络】:TCP协议中的三次握手和四次握手
- 扣丁学堂笔记第19天Volley、android-async-http、ksoap2与WebView组件
- 如何在Android开发中高效使用Volley网络框架
- 跟着柴毛毛学Spring(4)——面向切面编程![这里写图片描述](http://img.blog.csdn.net/20171031111402095)
- 计算机网络最常用命令---网络安全必杀技
- Java联网技术之一TCP