您的位置:首页 > 其它

机器学习学习笔记 PRML Chapter 1.2 : Probability Theory

2016-06-24 00:44 681 查看

Chapter 1.2 : Probability Theory

PRML, OXford University Deep Learning Course, Machine Learning, Pattern Recognition

Christopher M. Bishop, PRML, Chapter 1 Introdcution

Chapter 12 Probability Theory
Uncertainty

Example discussed through this chapter

Basic Terminology
1 Probability densities

2 Expectations and covariances

3 Joint Marginal Conditional Probability

4 The Rules of Probability

An Important Interpretation of Bayes Theorem

Interpretation of Bayes Theorem See Page 17 in PRML

Bayesian Probability
1 Two Interpretations of Probabilities

2 摘自PRML 笔记1

3 Bayes theorem and Bayesian Probability
Using examples to understand Bayesian Probability and Bayes theorem

Bayes theorem

How to interpret likelihood function in both the Bayesian and frequentist paradigms

4 Pros and Cons -

5 应对over-fitting问题摘自Ref 1

5 Difficulties in Carrying through the Full Bayesian Procedure Marginalization

Maximum-likelihood Estimation MLE for a univariate Gaussian Case
1 Gaussian distribution

2 Sampling from a Gaussian distribution see Ref 2

3 Take the univariate Gaussian for example
Why taking the log

4 One Limitation of the Maximum Likelihood Approach
Exercise 112 Proof of 157 and 158

Solution

Curve fitting re-visited
1 Purpose
MLE Point estimate rightarrow Probabilistic Model rightarrow MAP rightarrow Bayesian

2 Goal in the curve fitting problem

3 Uncertainty over the value of the target variable tt

4 The Likelihood for Linear Regression and its Solution of MLE in the Point Estimate Category see Ref -2

5 Making predictions

6 Taking a step towards a more Bayesian approach
Note

Bayesian Curve fitting

Curve fitting为例子演示三种方法 See Ref-1

References

1. Uncertainty

A key concept in the field of pattern recognition is that of uncertainty. It arises both through noise on measurements, as well as through the finite size of data sets. Probability theory provides a consistent framework for the quantification and manipulation of uncertainty and forms one of the central foundations for pattern recognition. When combined with decision theory, discussed in Section 1.5 (see PRML), it allows us to make optimal predictions given all the information available to us, even though that information may be incomplete or ambiguous.

2. Example discussed through this chapter

We will introduce the basic concepts of probability theory by considering a simple example. Imagine we have two boxes, one red and one blue, and in the red box we have 2 apples and 6 oranges, and in the blue box we have 3 apples and 1 orange. This is illustrated in Figure 1.9.



Now suppose we randomly pick one of the boxes and from that box we randomly select an item of fruit, and having observed which sort of fruit it is we put it back in the box from which it came. We could imagine repeating this process many times. Let us suppose that in so doing we pick the red box 40% of the time and we pick the blue box 60% of the time, and that when we pick an item of fruit from a box we are equally likely to select any of the pieces of fruit in the box.

In this example, the identity of the box that will be chosen is a random variable, which we shall denote by B. This random variable can take one of two possible values, namely r (corresponding to the red box) or b (corresponding to the blue box). Similarly, the identity of the fruit is also a random variable and will be denoted by F . It can take either of the values a (for apple) or o (for orange). To begin with, we shall define the probability of an event to be the fraction of times that event occurs out of the total number of trials, in the limit that the total number of trials goes to infinity. Thus the probability of selecting the red box is 4/10.

3. Basic Terminology

3.1 Probability densities

PDF, Probability Density Function: If the probability of a real-valued variable x falling in the interval (x,x+δx) is given by p(x)δx for δx→0, then p(x) is called the probability density over x.



and pdf p(x) must satisfy the two conditions p(x)∫∞−∞p(x)dx≥0(1.25)=1(1.26)

PMF, Probability Mass Function: Note that if x is a discrete variable, then p(x) is called a probability mass function because it can be regarded as a set of “probability masses” concentrated at the allowed values of x.

CDF, Cumulative Distribution Function: The probability that x lies in the interval (−∞,z) is given by the cumulative

distribution function defined by P(z)=∫z∞p(x)dx(1.28)

which satisfies P′(x)=p(x).

3.2 Expectations and covariances

Expectation of f(x) : the average value of some function f(x) under a probability distribution p(x) is called the expectation of f(x) and will be denoted by E[f], shown as E[f]=∑xp(x)f(x) and E[f]=∫p(x)f(x)dx, for discrete variables and continuous variables, respectively.

Approximating expectation using sampling methods: if we are given a finite number N of points drawn from the pdf, then the expectation can be approximated as a finite sum over these points



Expectations of functions of several variables: here we can use a subscript to indicate which variable is being averaged over, so that for instance Ex[f(x,y)] denotes the average of the function f(x,y) with respect to the distribution of x. Note that Ex[f(x,y)] will be a function of y.

Variance of f(x): is defined by var[f]=E[(f(x)−E[f(x)])2], and provides a measure of how much variability there is in f(x) around its mean value E[f(x)]. Expanding out the square, we get var[f]=E[f2(x)]−E2[f(x)].

Variance of the variable x itself: var[x]=E[x2]−E2[x].

Covariance of two r.v. x and y: is defined by



Covariance of two vecotrs of r.v.’s x and y: is defined by



Covariance of the components of a vector x with each other: then we use a slightly simpler notation cov[x]≡cov[x,x].

3.3 Joint, Marginal, Conditional Probability

In order to derive the rules of probability, consider the following example shown in Figure 1.10 involving two random variables X and Y. We shall suppose that X can take any of the values xi where i=1,...,M , and Y can take the values yj, where j=1,...,L. Consider a total of N trials in which we sample both of the variables X and Y , and let the number of such trials in which X=xi and Y=yj be nij . Also, let the number of trials in which X takes the value xi (irrespective of the value that Y takes) be denoted by ci , and similarly let the number of trials in which Y takes the value yj be denoted by rj.



joint probability: p(X=xi,Y=yj) is called the joint probability of X=xi and Y=yj, and is given by p(X=xi,Y=yj)=nijN(1.5)

Here we are implicitly considering the limit N→∞.

marginal probability: p(X=xi) is called the marginal probability, because it is obtained by marginalizing, or summing out, the other variables (in this case Y ), i.e., p(X=xi)=∑j=1Lp(X=xi,Y=yj)(1.7)

conditional probability: p(Y=yj∣X=xi) is called the conditional probability of Y=yj given X=xi, obtained by p(Y=yj∣X=xi)=nijci(1.8)

From (1.5), (1.6), and (1.8), we can then derive the following relationship

p(X=xi,Y=yj)=nijN=nijci⋅ciN=p(Y=yj∣X=xi)⋅p(X=xi) which is called the product rule of probability.

3.4 The Rules of Probability

Discrete Variables:



Continuous Variables: if x and y are two real continuous variables, then the sum and product rules take the form



Bayes’ theorem: From the product rule, together with the symmetry property p(X,Y)=p(Y,X), we immediately obtain the following relationship between conditional probabilities p(Y∣X)=p(X∣Y)p(Y)p(X)(1.12)

Using the sum rule, the denominator in Bayes’ theorem can be expressed in terms of the quantities appearing in the numerator P(X)=∑Yp(X∣Y)p(Y)(1.13)

We can view the denominator in Bayes’ theorem as being the normalization constant required to ensure that the sum of the conditional probability on the left-hand side of (1.12) over all values of Y equals 1.

4. An Important Interpretation of Bayes’ Theorem

Let us now return to our example involving boxes of fruit. For the moment, we shall once again be explicit about distinguishing between the random variables and their instantiations. We have seen that the probabilities of selecting either the red or the blue boxes are given by p(B=r)=4/10, and p(B=b)=6/10, respectively. Note that these satisfy p(B=r)+p(B=b)=1.

Now suppose that we pick a box at random, and it turns out to be the blue box. Then the probability of selecting an apple is just the fraction of apples in the blue box which is 3/4, and so p(F=a∣B=b)=3/4. In fact, we can write out all four conditional probabilities for the type of fruit, given the selected box



We can now use the sum and product rules of probability to evaluate the overall probability of choosing an apple



from which it follows, using the sum rule, that p(F=o)=1−11/20=9/20.

Interpretation of Bayes’ Theorem (See Page 17 in PRML)

Suppose instead we are told that a piece of fruit has been selected and it is an orange, and we would like to know which box it came from. This requires that we evaluate the probability distribution over boxes conditioned on the identity of the fruit, whereas the probabilities in (1.16)–(1.19) give the probability distribution over the fruit conditioned on the identity of the box. We can solve the problem of reversing the conditional probability by using Bayes’ theorem to give



From the sum rule, it then follows that p(B=b∣F=o)=1−2/3=1/3.

We can provide an important interpretation of Bayes’ theorem as follows.

Prior probability: If we had been asked which box had been chosen before being told the identity of the selected item of fruit, then the most complete information we have available is provided by the probability p(B). We call this the prior probability because it is the probability available before we observe the identity of the fruit.

Posterior probability : Once we are told that the fruit is an orange, we can then use Bayes’ theorem to compute the probability p(B∣F), which we shall call the posterior probability because it is the probability obtained after we have observed F .

Evidence: Note that in this example, the prior probability of selecting the red box was 4/10, so that we were more likely to select the blue box than the red one. However, once we have observed that the piece of selected fruit is an orange, we find that the posterior probability of the red box is now 2/3, so that it is now more likely that the box we selected was in fact the red one. This result accords with our intuition, as the proportion of oranges is much higher in the red box than it is in the blue box, and so the observation that the fruit was an orange provides significant evidence favoring the red box. In fact, the evidence is sufficiently strong that it outweighs the prior and makes it more likely that the red box was chosen rather than the blue one.

Independent: Finally, we note that if the joint distribution of two variables factorizes into the product of the marginals, so that p(X,Y)=p(X)p(Y), then X and Y are said to be independent. From the product rule, we see that p(Y∣X)=p(Y), and so the conditional distribution of Y given X is indeed independent of the value of X. For instance, in our boxes of fruit example, if each box contained the same fraction of apples and oranges, then p(F∣B)=P(F), so that the probability of selecting, say, an apple is independent of which box is chosen.

5. Bayesian Probability

5.1 Two Interpretations of Probabilities:

Classical or Frequentist Interpretation: we have viewed probabilities in terms of the frequencies of random, repeatable events, and have defined the probability of an event to be the fraction of times that event occurs out of the total number of trials, in the limit that the total number of trials goes to infinity.

Bayesian Interpretation of Probability: Introduce the “uncertainty” or “degrees of belief”. Consider an uncertain event, for example whether the Arctic ice cap will have disappeared by the end of the century. These are not events that can be repeated numerous times in order to define a notion of probability as we did earlier in the context of boxes of fruit. Nevertheless, we will generally have some idea, for example, of how quickly we think the polar ice is melting. If we now obtain fresh evidence, for instance from a new Earth observation satellite gathering novel forms of diagnostic information, we may revise our opinion on the rate of ice loss. Our assessment of such matters will affect the actions we take, for instance the extent to which we endeavour to reduce the emission of greenhouse gasses. In such circumstances, we would like to be able to quantify our expression of uncertainty and make precise revisions of uncertainty in the light of new evidence, as well as subsequently to be able to take optimal actions or decisions as a consequence. This can all be achieved through the elegant, and very

general, Bayesian interpretation of probability.

Created with Raphaël 2.1.0How to interpret Probability?1. Frequentist Paradigm - Classical or Frequentist Interpretation of probability, i.e., to view probabilities in terms of the frequencies of random, repeatable events.2. Bayesian Paradigm - Bayesian interpretation of probability, in which probabilities provide a quantification of uncertainty, or degrees of belief. It includes the following:Prior, before observation of new data.fresh evidence.Posterior, to corporate the evidence to revise the prior, i.e., to generate posterior.Conclusion, Bayesian Paradigm is a more general view than Frequentist Interpretation.

5.2 摘自:[PRML 笔记][1]

与其说是Bayesian对“概率”这个概念的解释,不如说是概率碰巧可以作为量化Bayesian “degree of belief”这个概念的手段。Bayesian的出发点是“uncertainty”这一概念,对此给予“degree of belief”以表示不确定性。The use of probability to represent uncertainty, however, is not an ad-hoc choice, but is inevitable if we are to respect common sense while making rational coherent inferences. Cox showed that if numerical values are used to represent degrees of belief, then a simple set of axioms encoding common sense properties of such beliefs leads uniquely to a set of rules for manipulating degrees of belief that are equivalent to the sum and product rules of probability. 因此之故,我们才可以 use the machinery of probability theory to describe the uncertainty in model parameters.

对parameters的观点, 以及Bayesian对先验、后验概率的解释: 对于 Frequentist 来说, model parameter w 是一个 fixed 的量,要用“estimator”来估计(最常见的 estimator 是 likelihood,即maximum likelihood estimation)。然而,对 Bayesian 来说, w 本身是一个不确定量,其不确定性用 prior probability p(w)表示。为了获知 fixed 的w, Frequentist 进行重复多次的试验,获得不同的 data sets D; 对于 Bayesian 而言, there is only a single data set D, namely the one that is actually observed. 在得到一个 observation D 后, 贝叶斯学派要revise原来对于参数w 的 belief(prior probability), 用后验概率 p(w∣D)表示调整后的 belief。调整的方法是贝叶斯定理Bayes’ Theorem。Bayesian 的中心定理是贝叶斯定理, 该定理 convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data。其中的条件概率P(D∣w) called likelihood, 表示的是how probable the observed data set is for different settings of parameter vector w.



上式分母p(D) 只是用于归一化的量, 使得等式LHS的p(w∣D)确实是一个概率。而 p(D)的计算已经给出在上面的分母中。

理解后验概率: 即修正后的先验概率。例如,有 C1,...,Ck 个类别,先验为P(C1),...,P(Ck),这个时候如果给一个未知类别的数据让我们猜它是哪个类别, 显然应该猜先验概率最大的那个类别。在观察到数据x 后,计算后验概率 P(C1∣x),...,P(Ck∣x) .。于是此时的“先验”修正为 P′(C1)=P(C1∣x),...,P′(Ck)=P(Ck∣x) 。如果现在再来一个未知类别的数据让我们猜,我们猜的方法仍旧是找先验概率最大的那个类别,只不过此时的先验概率是 P′(C1),...,P′(Ck) 。

5.3 Bayes’ theorem and Bayesian Probability

Using examples to understand Bayesian Probability and Bayes’ theorem:

Fruit Example: Recall that in the boxes of fruit example, the observation of the identity of the fruit provided relevant information that altered the probability that the chosen box was the red one. In that example, Bayes’ theorem was used to convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data.

Polynomial curve fitting example: we can adopt a similar approach when making inferences about quantities such as the parameters w in the polynomial curve fitting example. We capture our assumptions about w, before observing the data, in the form of a prior probability distribution p(w). The effect of the observed data D=t1,...,tN is expressed through the conditional probability p(D∣w), and we shall see later, in Section 1.2.5, how this can be represented explicitly.

Bayes’ theorem:

Bayes’ theorem, which takes the form

p(w∣D)=p(D∣w)p(w)p(D)=p(D∣w)p(w)∫p(D∣w)p(w)dw (1.43)

i.e.,

Posterior=Likelihood×PriorEvidence∝Likelihood×Prior (1.44)

then allows us to evaluate the uncertainty in w after we have observed D in the form of the posterior probability p(w∣D).

The Bayes’ Theorem, where a
4000
ll of these quantities are viewed as functions of w, incorporates 4 notions:

- Prior: p(w)

- Likelihood: The quantity p(D∣w) on the right-hand side of Bayes’ theorem is evaluated for the observed data set D and can be viewed as a function of the parameter vector w, in which case it is called the likelihood function. It expresses how probable the observed data set is for different settings of the parameter vector w. Note that the likelihood is not a probability distribution over w, and its integral with respect to w does not (necessarily) equal 1.

- Evidence: p(D), the denominator in (1.43) is the normalization constant, which ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to 1.

- Posterior: p(w∣D).

How to interpret likelihood function in both the Bayesian and frequentist paradigms?

In a Frequentist setting: w is considered to be a fixed parameter, whose value is determined by some form of “estimator” (A widely used frequentist estimator is maximum likelihood, in which w is set to the value that maximizes the likelihood function p(D∣w). This corresponds to choosing the value of w for which the probability of the observed data set is maximized), and error bars (One approach to determining frequentist error bars is the bootstrap, in which multiple data sets are created by repeated sampling from the original data set) on this estimate are obtained by considering the distribution of possible data sets D.

From the Bayesian viewpoint, there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over w.

5.4 Pros (+) and Cons (-)

Pros(+) of Bayes over Frequentist: the inclusion of prior knowledge arises naturally. Suppose, for instance, that a fair-looking coin is tossed three times and lands heads each time. A classical maximum likelihood estimate of the probability of landing heads would give 1, implying that all future tosses will land heads! By contrast, a Bayesian approach with any reasonable prior will lead to a much less extreme conclusion.

Cons(-) of Bayes against Frequentist: one common criticism of the Bayesian approach is that the prior distribution is often selected on the basis of mathematical convenience rather than as a reflection of any prior beliefs. Even the subjective nature of the conclusions through their dependence on the choice of prior is seen by some as a source of difficulty. Reducing the dependence on the prior is one motivation for so-called noninformative priors. However, these lead to difficulties when comparing different models, and indeed Bayesian methods based on poor choices of prior can give poor results with high confidence. Frequentist evaluation methods offer some protection from such problems, and techniques such as cross-validation remain useful in areas such as model comparison.

Cons of Frequentist: Over-fitting problem can be understood as a general property of maximum likelihood.

5.5 [应对over-fitting问题][摘自:Ref 1]

Frequentist 控制 over-fitting 的方法:

Regularization,即在目标函数中加入一个 penalty term: L2 regularizer 被称为 ridge regression, L1 regularizer 被称为 Lasso regression。 加入 penalty 的方法也叫 shrinkage method, 因为它可以 reduce the value of the coefficients.

Cross-validation,即留出一部分数据做 validation. Cross-validation 也是一种进行 model selection 的方法。利用留出来的validation data,可以选择多个所训练 model 中的最好的一个。

Bayesian 控制 over-fitting 的方法: Prior probability.

5.5 Difficulties in Carrying through the Full Bayesian Procedure: Marginalization

The practical application of Bayesian methods was for a long time severely limited by the difficulties in carrying through the full Bayesian procedure, particularly the need to marginalize (sum or integrate) over the whole of parameter space, which, as we shall see, is required in order to make predictions or to compare different models.

(^=^感觉这种心得体会之类的东西,必须得用中文说出来才过瘾! ^=^) Bayesian methods 的应用长期受制于 marginalization。对于一个 full Bayesian procedure 来说, 它要make prediction 或 compare different models, 必要的一步是 marginalize (sum or integrate) over the whole of parameter space.

The door to the practical use of Bayesian techniques in an impressive range of problem domains is opened due to the following: :

1. the development of sampling methods, e.g., Markov Chain Monte Carlo (MCMC). Monte Carlo methods are very flexible and can be applied to a wide range of models. However, they are computationally intensive and have mainly been used for small-scale problems.

2. Dramatic improvements in the speed (i.e. CPU) and memory capacity of computers.

3. Highly efficient deterministic approximation schemes, such as variational Bayes and expectation propagation (discussed in Chapter 10) have been developed. These offer a complementary alternative to sampling methods and have allowed Bayesian techniques to be used in large-scale applications (Blei et al., 2003).

6. Maximum-likelihood Estimation (MLE) for a univariate Gaussian Case

6.1 Gaussian distribution:

1-dimension:



D-dimension:



where the D-dimensional vector μ is called the mean, the D×D matrix Σ is called the covariance, and |Σ| denotes the determinant of Σ.

6.2 Sampling from a Gaussian distribution [see Ref 2]



6.3 Take the univariate Gaussian for example.

Now suppose that we have a data set of observations x=(x1,...,xN)T , representing N observations of the scalar variable x. We shall suppose that the observations are drawn independently from a Gaussian distribution whose mean μ and variance σ2 are unknown, and we would like to determine these parameters from the data set. Data points that are drawn independently from the same distribution are said to be independent and identically distributed, which is often abbreviated to i.i.d.

Because our data set x is i.i.d., we can therefore write the probability of the data set, given μ and σ2 , in the form



In practice, it is more convenient to maximize the log of the likelihood function, written in the form



Why taking the log?

The logarithm is a monotonically increasing function of its argument, maximization of the log of a function is equivalent to maximization of the function itself.

Taking the log simplifies the subsequent mathematical analysis;

Taking the log helps numerically because the product of a large number of small probabilities can easily underflow the numerical precision of the computer, and this is resolved by computing instead the sum of the log probabilities.

When viewed as a function of μ and σ2 , this is the likelihood function for the Gaussian and is interpreted diagrammatically in Figure 1.14.


Maximizing (1.54) with respect to μ, we obtain the maximum likelihood solution given by μML=1N∑n=1Nxn(1.55)

which is called sample mean, i.e., the mean of the observed values {xn}. Similarly, maximizing (1.54) with respect to σ2 , we obtain the so-called sample variance σ2ML measured with respect to the sample mean μML in the form σ2ML=1N∑n=1N(xn−μML)2(1.56)

6.4 One Limitation of the Maximum Likelihood Approach

Limitation: The maximum likelihood approach systematically underestimates the variance of the distribution. This is an example of a phenomenon called bias and is related to the problem of over-fitting encountered in the context of polynomial curve fitting.



We first note that the maximum likelihood solutions μML and σ2ML are functions of the data set values x1,...,xN . Consider the expectations of these quantities with respect to the data set values, which (i.e., the data set) themselves come from a Gaussian distribution with parameters μ and σ2, i.e. x∼N(μ,σ2) . It is straightforward to show that

E[μML]=μ(1.57)

E[σ2ML]=(N−1N)σ2<σ2(1.58)

From (1.58) it follows that the following estimate for the variance parameter is unbiased


In fact, as we shall see, the issue of bias in maximum likelihood lies at the root of the over-fitting problem that we encountered earlier in the context of polynomial curve fitting.

Exercise 1.12 : Proof of (1.57) and (1.58)



Solution:



7. Curve fitting re-visited

7.1 Purpose:

MLE, Point estimate → Probabilistic Model → MAP → Bayesian

We have seen how the problem of polynomial curve fitting can be expressed in terms of error minimization in Section 1.1. Here we return to the curve fitting example and view it from a probabilistic perspective, thereby gaining some insights into error functions and regularization, as well as taking us towards a full Bayesian treatment.

7.2 Goal in the curve fitting problem:

The goal in the curve fitting problem is to be able to make predictions for the target variable t given some new value of the input variable x on the basis of a set of training data {x,t}.

7.3 Uncertainty over the value of the target variable t

We can express our uncertainty over the value of the target variable t using a probability distribution. For this purpose, we shall assume that, given the value of x, the corresponding value of t has a Gaussian distribution with a mean equal to the value y(x,w) of the polynomial curve given by (1.1). Thus we have

p(t∣x,w,β)=N(t∣y(x,w),β−1)(1.60)

where, for consistency with the notation in later chapters, we have defined a precision parameter β corresponding to the inverse variance of the distribution. This is illustrated schematically in Figure 1.16.



For the i.i.d. training data {x,t}, the likelihood function is given by



and the log likelihood function in the form



We can use maximum likelihood to determine the precision parameter β of the Gaussian conditional distribution,



7.4 The Likelihood for Linear Regression and its Solution of MLE in the Point Estimate Category [see Ref -2]

The same idea can be found in Lecture 3 of [Ref-2], shown as below



Please note here yi is used to represent the target variable. The maximum likelihood estimate (MLE) of θ is obtained by taking the derivate of the log-likelihood, logp(y∣xi,θ,σ). The goal is to maximize the likelihood of seeing the training data {x,y} by modifying the parameters (θ,σ).



The MLE of theta is:



The MLE of σ is:



7.5 Making predictions:

Because we now have a probabilistic model, these are expressed in terms of the predictive distribution that gives the probability distribution over t, rather than simply a point estimate, and is obtained by substituting the maximum likelihood parameters into (1.60) to give



As the special case, the Linear Regressionthe: [MLE plugin prediction] [Ref-2], given the training data D=(X,y), for a new input x∗ and known σ2:



shown in the following figure



7.6 Taking a step towards a more Bayesian approach

Prior distribution over the polynomial coefficients w: For simplicity, let us consider a Gaussian distribution of the form


where α is the precision of the distribution, and M+1 is the total number of elements in the vector w for an Mth order polynomial.

Hyperparameters: Variables such as α, which control the distribution of model parameters, are called hyperparameters.

Calculate the Posterior distribution for w: Using Bayes’ theorem, the Posterior distribution for w is given by


MAP, maximum posterior: We can now determine w by finding the most probable value of w given the data, in other words by maximizing the posterior distribution. This technique is called maximum posterior, or simply MAP. Taking the negative logarithm of (1.66) and combining with (1.62) and (1.65), we find that the maximum of the posterior is given by the minimum of



Equivalence between Posterior and Regularized sum-of-squares Error function: Thus we see that maximizing the posterior distribution is equivalent to minimizing the regularized sum-of-squares error function encountered earlier in the form (1.4), with a regularization parameter given by λ=α/β.

Note:

Although we have included a prior distribution p(w∣α), we are so far still making a point estimate of w and so this does not yet amount to a Bayesian treatment, discussed in the following section.

8. Bayesian Curve fitting

In a fully Bayesian approach, we should consistently apply the sum and product rules of probability, which requires, as we shall see shortly, that we integrate over (i.e., to marginalize) all values of w. Such marginalizations lie at the heart of Bayesian methods for pattern recognition.

In the curve fitting problem, we are given the training data x and t, along with a new test point x, and our goal is to predict the value of t. We therefore wish to evaluate the predictive distribution p(t∣x,x,t). Here we shall assume that the parameters α and β are fixed and known in advance (in later chapters we shall discuss how such parameters can be inferred from data in a Bayesian setting).

A Bayesian treatment simply corresponds to a consistent application of the sum and product rules of probability, which allow the predictive distribution to be written in the form



- p(t∣x,w) in RHS: is given by (1.60), and we have omitted the dependence on α and β to simplify the notation.

- p(w∣x,t) in RHS: is the posterior distribution over parameters, and can be found by normalizing the right-hand side of equation (1.66). It will be shown in Section 3.3 that this posterior distribution is a Gaussian and can be evaluated analytically.

- LHS: the integration in (1.68) can also be performed analytically with the result that the predictive distribution is given by a Gaussian of the form



where the mean and variance are given by



Here the matrix S is given by



where I is the unit matrix, and the vector ϕ(x)=(1,x,x2,...,xM)T.

Analysis of (1.71):

- the first term β−1: represents the uncertainty in the predicted value of t due to the noise on the target variables and was expressed already in the maximum likelihood predictive distribution (1.64) through βML

- the second term ϕ(x)TSϕ(x): arises from the uncertainty in the parameters w and is a consequence of the Bayesian treatment.

The predictive distribution for the synthetic sinusoidal regression problem is illustrated in Figure 1.17.



9. Curve fitting为例子演示三种方法 [See Ref-1]

1) MLE,直接对 likelihood function 求最大值,得到参数 w。该方法属于 point estimation。

2) MAP (poor man’s bayes),引入 prior probability,对 posterior probability 求最大值,得到w。MAP 此时相当于在 MLE 的目标函数(likelihood function)中加入一个 L2 penalty。该方法仍属于 point estimation。

3) fully Bayesian approach,需要使用 sum rule 和 product rule (因为“degree of belief”的machinery 和概率相同, 因此这两个 rule 对 “degree of belief”成立), 而要获得 predictive distribution 又需要 marginalize (sum or integrate) over the whole of parameter space w。



其中, x 是待预测的点, X 是观察到的数据集, t 是数据集中每个数据点相应的 label。其实是用参数 w 的后验概率为权, 对 probability 进行一次加权平均; 因此这个过程需要对w 进行积分, 即 marginalization

10. References

[1]: http://www.cvrobot.net/wp-content/uploads/2015/09/PRML%E7%AC%94%E8%AE%B0-Notes-on-Pattern-Recognition-and-Machine-Learning-1.pdf, Page 4-6, Chapter 01, PRML笔记,Notes-on-Pattern-Recognition-and-Machine-Learning;

[2]: https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/; Slides of Deep Learning Course at Oxford University;
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息