您的位置:首页 > Web前端

Feedforward Deep Networks(要点)

2015-10-16 20:25 295 查看
目录:

Feedforward Deep Networks
MLPs from the 1980s
Shallow Multi-Layer Neural Network for Regression

Estimating Conditional Statistics

Parametrizing a Learned Predictor
Family of Functions
Non-linearities

Loss Function and Conditional Log-Likelihood
KL divergence

Learning a Conditional Probability Model

Softmax
the Squared Error applied to Softmax

Feedforward Deep Networks

原文

@unpublished{
Bengio-et-al-2015-Book,
title={
Deep Learning
},
author={
Yoshua Bengio and Ian J. Goodfellow and Aaron Courville
},
note={
Book in preparation for MIT Press
},
url={ http://www.iro.umontreal.ca/~bengioy/dlbook },
year={
2015
}
}


Feedforward deep networks, also known as multilayer perceptrons (MLPs), are the quintessential deep networks.

In neural network terminology, we refer to each sub-function as a layer of the network, and each scalar output of one of these functions as a unit or sometimes as a feature.

We can think of the number of units in each layer as being the width of a machine learning model, and the number of layers as its depth.

MLPs from the 1980’s

The layers of the network that correspond to features rather than outputs are called hidden layers. This is because the correct values of the features are unknown.

Shallow Multi-Layer Neural Network for Regression

The family of input-output functions:

fθ(x)=b+Vsigmoid(c+Wx)

sigmoid(a)=1/(1+e−a)

The hidden layer outputs:

h=sigmoid(c+Wx)

The parameters:

θ=(b,c,V,W)

The loss function (=> the squared error):

L(y^,y)=||y^−y||2

The regularizer: (=> L2 weight decay)

||ω||2=(∑ijW2ij+∑kiV2ki)

Cost function obtained by adding together the squared loss and the regularization term:

J(θ)=λ||ω||2+1n∑nt=1||y(t)−(b+Vsigmoid(c+Wx(t)))||2

(x(t),y(t)) is the t-th training example, an (input,target) pair.

Stochastic gradient descent:

ω←ω−ϵ(2λω+∇ωL(fθ(x(t)),y(t)))

β←β−ϵ∇βL(fθ(x(t)),y(t))

where β=(b,c), ω=(W,V), ϵ is a learning rate.

MLPs can learn powerful non-linear transformations: in fact, with enough hidden units they can represent arbitrarily complex but smooth functions.

By transforming the data non-linearly into a new space, a classification problem that was not linearly separable (not solvable by a linear classifier) can become separable.

Estimating Conditional Statistics

We can generalize linear regression to regression via any function f by defining the mean squared error of:

E[||y−f(x)||2]

Minimizing it yields an estimator of the conditional expectation of the output variable y given the input variable x:

argminf∈HEp(x,y)[||y−f(x)||2]=Ep(x,y)[y|x]

Parametrizing a Learned Predictor

Family of Functions

To compose simple transformations in order to obtain highly non-linear ones.

A multi-layer neural network with more than one hidden layer can be defined by generalizing:

(chose to use hyperbolic tangent activation functions)

hk=tanh(bk+Wkhk−1)

Non-linearities

There are several non-linearities, most of them are typically combined with an affine transformation and applied element-wise:

a=b+Wx

h=σ(a)⇔hi=σ(ai)=σ(bi+Wi,:x).

Rectifier or rectified linear unit (ReLU) or positive part:

σ(a)=max(0,a) , also written σ(a)=(a)+

effective variants:

hi=σ(a,αi)=max(0,a)+αimin(0,a)

where αi can be a small fixed value like 0.01.

Hyperbolic tangent:

σ(a)=tanh(a)

Sigmoid:

σ(a)=1/(1+e−a)

Softmax:

σ(a)=softmax(a)=eai/∑jeai

where ∑iσi(a)=1 and σi(a)>0

The softmax output can be considered as a probability distribution over a finite set of outcomes.

Softplus:

σ(a)=ζ(a)=log(1+ea)

A smooth version of the rectifier.

Hard tanh:

σ(a)=max(−1,min(1,a))

Absolute value rectification:

σ(a)=|a|

It makes sense to seek features that are invariant under a polarity reversal.

Loss Function and Conditional Log-Likelihood

For classification problems, loss functions the Bernoulli negative log-likelihood have been found to be more appropriate than the squared error:

L(fθ(x),y)=−ylogfθ(x)−(1−y)log(1−fθ(x))

where y∈0,1

Also known as cross entropy objective function.

The optimal f minimizing this loss function is:

f(x)=P(y=1|x)

When maximizing the conditional log-likelihood objective function,we are training the neural net output to estimate conditional probabilities as well as possible in the sense of the KL divergence.

In order for the above expression of the criterion to make sense, fθ(x) must be strictly between 0 and 1.

To achieve this, it is common to use the sigmoid as non-linearity.

Any loss consisting of a negative log-likelihood is a cross entropy between the empirical distribution defined by the training set and the model.

For example, mean squared error is the cross entropy between the empirical distribution and a Gaussian model.

KL divergence

If we have two separate probability distributions P(x) and Q(x) over the same random variable x, we can measure how different these two distributions are using the Kullback-Leibler (KL) divergence:

DKL(P||Q)=Ex∼P[logP(x)Q(x)]

DKL(P||Q)=Ex∼P[logP(x)−logQ(x)]

In the case of discrete variables, it is the extra amount of information needed to send a message containing symbols drawn from probability distribution P, when we use a code that was designed to minimize the length of messages drawn from probability distribution Q.

It is not a true distance measure because it is not symmetric, i.e.:

DKL(P||Q)≠DKL(Q||P) for some P and Q.

Learning a Conditional Probability Model

the negative log-likelihood (NLL) cost function:

LNLL(fθ(x),y)=−logP(y=y|x=x;θ)

This criterion corresponds to minimizing the KL divergence between the model P of the conditional probability of y given x and the data generating distribution Q, approximated by the finite training set.

For discrete variables, the binomial negative log-likelihood cost function corresponds to the conditional log-likelihood associated with the Bernoulli distribution:

LNLL=−logP(y|x;θ)=−1y=1logp−1y=0log(1−p)

LNLL=−ylogfθ(x)−(1−y)log(1−fθ(x))

where 1y=1 is the usual binary indicator.

Softmax

When y is discrete and has a finite domain but is not binary, the Bernoulli distribution is extended to the multinoulli distribution.

The softmax non-linearity:

p=softmax(a)⇐⇒pi=eai∑jeaj.

The gradient with respect to the a:

∂∂akLNLL(p,y)==pk−1y=k

∂∂aLNLL(p,y)=(p−ey)

where ey=[0,...,0,1,0,...,0] is the one-hot vector with a 1 at position y.

the Squared Error applied to Softmax

Have vanishing gradient when an output unit saturates (when the derivative of the non-linearity is near 0), even if the output is completely wrong.

The Squared Error Loss:

L2(p(a),y)=||p(a)−y||2=∑k(pk(a)−yk)2

where y=ei=[0,...,0,1,0,...,0] , p=softmax(a)

The gradient of the loss is given by:

∂∂aiL2(p(a),y)=∂L2(p(a),y)∂p(a)∂p(a)∂ai

∂∂aiL2(p(a),y)=∑j2(pj(a)−yj)pj(1i=j−pi)

If the model incorrectly predicts a low probability for the correct class y=i,i.e., if py=pi≈0, then the score for the correct class, ay, does not get pushed up in spite of a large error, i.e., ∂∂aiL2(p(a),y)≈0.

Its output is invariant to adding a scalar to all of its inputs:

softmax(a)=softmax(a+b).

The numerically stable variant of the softmax:

softmax(a)=softmax(a−maxiai)

This allows us to evaluate softmax with only small numerical errors even when a contains extremely large or extremely negative numbers. (Used in caffe…)

未完待续。。。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: