您的位置:首页 > 移动开发

Naive Bayes Theorem and Application - Theorem

2016-04-10 13:04 337 查看

Naive Bayes Theorm And Application - Theorem

Naive Bayes model:

1. Naive Bayes model

2. model: discrete attributes with finit number of values

2. Parameter density estimation

3. Naive Bayes classification algorithm

4. AutoClass clustering alogrithm

1. Naive Bayes model

In this model, We want to estimate P(X1,...,Xn), While the assumption is that all attributes independent of each other, this is the same assumption as k-means, except this is discrete model.P(X1,...,Xn)=Π(X1,
4000
...,Pn)While P(Xi) can be any distribution you like, e.g. {0.5:red,0.2:blue,0.3:yellow}.

To simplify this problem, we assume all attributes Boolean. With no independence assumptions , the model will have 2n states of X1,...,Xn and 2n−1 independent parameters; While adding independence assumption, the the scale of parameters decreases to 2n+1, and the parameter and n parameters in total.

For example,in a classification problem,We assume θC is the probabili
21117
ty class. Then we have 2n+1 parameters:

P(C=T)=θCP(C=T)=1−θCP(Xi=T|C=T)=θTiP(Xi=F|C=T)=1−θTiP(Xi=F|C=T)=1−θFiθ⟨θC,θTi,...,θTn,θFi,...θFn⟩

As you can see above, it makes incredible saving in number of parameters. Representing P(X1,..,Xn) explicitly suffers from curse of dimensionality, while Πni=1P(Xi) does not. This savings results from very strong independence assumptions. In fact, Naive Bayes model performs very well when assumptions hold, but perform very bad when varibables are dependent. For example, NB model performs well in English but badly in Chinese for context are more imporant for understanding chinese correctrly. So we should be cautious about applying this strong assumptions to those model whose parameters are in substantially related.

Naive Bayes classifier

For a NB classification problem we should learn:

1. P(X1,...Xn|C)=Πni=1P(Xi|C), for each class assumes that Xi and Xj are conditionally independent of each other given C.

2. P(C)

To classify: given x, choose c that maximizes:

P(c|x) ∝ P(c)P(xi|c)

The NB classifier is a linear separators. Attributes act independently to produce classification, and they not interact, therefore they cannot capture concepts like XOR just like preceptrons.

There is an important point about linear separable problem. Many real world domains are not linearly separable, even for those domains there may be a pretty good linearly separable hypothesis. We may be better off learning a linearly separable hypothesis than learning a richer htpothesis. This is from a strong inductive bias - eaiser to learn.

In the following discussion we will assume that attributes and class are Boolean, and this is only to keep the notation simple. Everything generalizes to the case have many possible values.

A simple problem:

There is a soccer team, we have observed a sequence of games of the team. Based on this, we want to estimate the probability that the team will win a future game. The formulation about the problem is below:

* Variable X has states {f, t}(t = win)

* Parameter θ=P(X=t)

* Observations X1=t,X2=f,X3=f

* These comprise the data D

* Task: estimate θ

* Use θ to estimate P(X4=t)

Firstly we will introduce Maximum likelihood(ML) algorithm, the likelihood mean:

* Likelihhod: L(θ)=P(D|θ)=P(X1,X2,X3|θ)

* ML Principle: Choose θ so as to maximizeL(θ)

* L(θ)=P(X1|θ)P(X2|θ)P(X3|θ)

* Log likelihood: LL(θ)=logP(X1|θ)+logP(X2|θ)+logP(X3|θ)

* ML Principle equivalent: Choose θ so as to maximize LL(θ)

In this example:

P(Xi=t|θ)=θ

L(θ)=P(X1=t,X62=f,X3=f|θ)=θ(1−θ)(1−θ)

LL(θ)=logθ+2log(1−θ)

set derivative to 0: 1θ − 21−θ=0

Solve to find: θ=1/3

In the example just now, you can find that θ=1/3 exactly the fraction of observed games in which the team won. But this is no coincidence: The ML estimate for the probability of an event always the fraction of time in which the event happened.In other words, ML’s estimate is exactly the one most suggested by the data. More generally, we get the observations X1,X2,...,Xn, let Nt be the number of instances with value t and Nf be the number of instances with value f. Then the maximum likelihood estimate for θ is: θ̂ =NtNt+Nf=NtN

Problem with this approach

Overfits: pays too much attention to noise in the data, for example if the team was particularly compete with Chinese national soccer team recently, then we will oversee the team performance.

Ignorespriorexperience: If some experts told you that the team is a small team, you should not be confident even you have won CNS.

Events don’t occur in the data are deemed impossible, for example the match end with 1 vs 1.

Incorporating a prior

* Prior: P(θ) before seeing any data

* Posterior: P(θ|D)

* Maximum a Posterior principle (MAP): Choose θ to maximize P(θ|D) and P(θ|D) is proportional to P(θ)L(θ)

For learning the parameter of a Boolean random variable, an appropriate prior over θ is the beta distribution. As you know, the beta distribution has 2 parameters: α and β, and these paramters control the shape of the prior. α and β control how relatively likely true and false outcomes are, if α is large relative to β, θ will be more likely to be large.

As the graph below:



And the magnitude of α and β control how peaked the beta distribution is, if α and β are large, the beta will be sharply peaked.

The magnitude of α and β:



Updating the prior

To get the hyperparameters for the posterior, we take the haperparamaters in the prior, and add to them the actual abservations that we get:

for example, the prior is Beta(4, 7), and we observe 1 “+”” and 4 “-“, then the posterior is Beta(5, 11).

Understanding the hyperparameters

Hyperparameter α represents the number of previous positive observation that we had, plus 1; similarly, β represents the number of previous “-” observations that we have had, plus 1. Hyperparameters in the prior as represent imaginary observations in our prior exprerience. The more we trust our prior experience, the larger the hyperparameters in the prior.

Mode and mean of the beta distribution

The mode of Beta(α,β) is α−1α−β−2. e.g. mode of Beta(2,3) is 1/3. The mean of Beta(α,β) is αα+β. e.g. mean of Beta(2,3) is 2/5.



MAP estimate

For the picture just now, we can see the MAP estimate is the mode of the posterior. This is the fraction of the total number of the observations that are true. e.g. for m postive instances out of N total θMAP^=m+α−1N+α+β−2

In the example above, the prior is Beta(5,3), and the observations is: X1=t,X2=f,X3=f, so the posterior is: Beta(6,5), The MAP estimate is: θMAP^=m+α−1N+α+β−2=59

ML vs MAP

Maximum likelihood estimate: θML^=mN; MAP estimate: θMAP^=m+α−1N+α+β−2. And Maximum likelihood is equivalent to MAP with a uniform prior: Beta(1,1), meaing that there are no imaginary observations.

Drawback of MAP

MAP not fully consider the range of possible values for θ, only choose the maximum value, this value may not be representative.(With greatest probability) So we introduce another approach, Baysian approach, this approach not makeestimate of θ. and the posterior distribution is maintainted over the value of θ. e.g. given X1,X2,X3 we want to predict X4 using the entire distribution over θ.

P(X4|X1,X2,X3)=∫10P(X4|θ)P(θ|X1,X2,X3)dθ=∫10θP(θ|X1,X2,X3)dθ=E[θ|X1,X2,X3]=meanoftheposterior

For beta distribution E[θ=m+αN+α+β]

E[θ] is the estimate of the probability that a new instance is true. And this is obtained by integrating over the posterior distribution.

In the problem above, with prior Beta(5, 3), the Maximum posrterior probability is θMAP^=m+α−1N+α+β−2=59, while the Bayesian approach Expectation is: θMAP^=m+αN+α+β=611, we can see the latter is more closely to 1/2.

Multi-valued class and attributes

In the first part of this essay, we only simplify the attributes and class to boolean, now generalizes to multi-value class and attributes.

given that: |C| = k and |X|=m, and parameters are listed as below:

P(C=c)=θcP(Xi=x|C=c)=θci,x

So the parameters grows up to k*m*n+k-1, the counts of instances is also listed below:

−N=totalnumberofinstances−N=totalnumberofinstanceswithclassc−N=totalnumberofinstanceswithclasscandXi=x−k∗m∗n+k+1parametersintotal.

From the analysis above, we can make a brief conclusion about Naive Bayes:

The advantages is that BN not suffers from the curse of dimensionality, and as you can see it very easy to implement, and it learns fastly. In each dimension, it makes no assumption about form of distribution. While in the other perspective, Naive Bayes make a strong independence assumption, it may perform poorly if this not hold(Chinese for example), and finding Maximum likelihood can overfit data.

reference

CMU statistical learning
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  ML