您的位置:首页 > 编程语言 > Python开发

AdaBoost分类算法实现

2014-09-18 22:43 585 查看
AdaBoost一般被认为是效果最好的分类器之一,总体思想是综合多个弱分类器来构造一个强分类器,

运行过程为:给定训练数据中每个样本一个权重,刚开始时,权重都相等。基于给定权重,构造一个

弱分类器,使得加权误差率最小,利用加权误差率计算该分类器的权重并预测样本的类别,分类正确

的样本的权重会降低,分类错误的样本的权重会增加。将此过程循环多次就得到一系列的弱分类器及

其权重,最后将这些弱分类器的分类结果的线性组合代入符号函数中得到最终的分类结果。

算法的伪代码如下:



本文使用Python语言实现AdaBoost分类算法

算法中选用的弱分类器是单层分类树,分类规则为"切分属性>=切分点"或“切分属性<切分点”

所有代码都位于一个文件adaboost.py中

from __future__ import division
import numpy as np

class AdaBoostCalssifier:
def __init__(self, n_estimators = 20):
self.n_estimators = n_estimators
self.list_weakClassifier = []

def weakClassify(self, X, dimen, val, thressIneq):
results = np.ones((X.shape[0], 1))
if thressIneq == 'lt':
results[X[:,dimen] <= val] = -1.0
else:
results[X[:,dimen] > val] = -1.0
return results

def findBestClassifier(self, X, y, w):
bestClassifier = []
m = X.shape[0]
n = X.shape[1]
labelEst = np.zeros((m,1))
numSteps = 10
minErr = np.inf
for i in range(n):
rangeMin = X[:,i].min()
rangeMax = X[:,i].max()
stepSize = (rangeMax - rangeMin)/numSteps
for j in range(numSteps+1):
for inequal in ['lt', 'gt']:
val = rangeMin + j*stepSize
predictResults = self.weakClassify(X, i, val, inequal)
errArr = np.ones((m,1))
errArr[predictResults == y] = 0
weightedErr = (w * errArr).sum()
if weightedErr < minErr:
labelEst = predictResults
minErr = weightedErr
bestDim = i
bestVal = val
bestIneq = inequal
bestClassifier.extend([bestDim, bestVal, bestIneq])
return bestClassifier, minErr, labelEst

def fit(self, X, y):
m = X.shape[0]
w = np.ones((m,1))/m
weightedLabelEst = np.zeros((m,1))
for i in range(self.n_estimators):
bestClassifier, minErr, labelEst = self.findBestClassifier(X, y, w)
if minErr>0.5: break
print "weighted vector: ",w.T
if minErr==0:
alpha = 1000
else:
alpha = 0.5*np.log((1-minErr)/minErr)
bestClassifier.append(alpha)
self.list_weakClassifier.append(bestClassifier)
print "current label estimation: ",labelEst.T
weightedLabelEst = weightedLabelEst + alpha*labelEst
print "weighted label estimation: ",weightedLabelEst.T
finalLabelEst = np.sign(weightedLabelEst)
errorVector = np.zeros((m,1))
errorVector[finalLabelEst != y] = 1
errorRate = errorVector.sum()/m
print "errorRate: ",errorRate.T
if minErr == 0 or errorRate == 0 or alpha == 0.5: break
w = w*np.exp(-alpha*y*labelEst)
w = w/w.sum()

def predict(self, observations):
m = observations.shape[0]
results = np.zeros((m,1))
for i in range(len(self.list_weakClassifier)):
currentLabelEst = \
self.weakClassify(observations, \
self.list_weakClassifier[i][0], \
self.list_weakClassifier[i][1], \
self.list_weakClassifier[i][2])
results = results + self.list_weakClassifier[i][3]*currentLabelEst
return np.sign(results)


利用《统计学习方法》中第140面的例8.1的训练数据检验下上述代码是否有效果,在Python Shell中运行以下代码:

import numpy as np
import adaboost
classifier = adaboost.AdaBoostCalssifier()
X = np.arange(10).reshape(10,1)
y = np.array([1,1,1,-1,-1,-1,1,1,1,-1]).reshape(10,1)
classifier.fit(X, y)
运行上述代码后可以检验下算法在训练数据集上的运行效果,如下所示:

weighted vector:  [[ 0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1  0.1]]
current label estimation:  [[ 1.  1.  1. -1. -1. -1. -1. -1. -1. -1.]]
weighted label estimation:  [[ 0.42364893  0.42364893  0.42364893 -0.42364893 -0.42364893 -0.42364893
-0.42364893 -0.42364893 -0.42364893 -0.42364893]]
errorRate:  0.3
weighted vector:  [[ 0.07142857  0.07142857  0.07142857  0.07142857  0.07142857  0.07142857
0.16666667  0.16666667  0.16666667  0.07142857]]
current label estimation:  [[ 1.  1.  1.  1.  1.  1.  1.  1.  1. -1.]]
weighted label estimation:  [[ 1.07329042  1.07329042  1.07329042  0.22599256  0.22599256  0.22599256
0.22599256  0.22599256  0.22599256 -1.07329042]]
errorRate:  0.3
weighted vector:  [[ 0.04545455  0.04545455  0.04545455  0.16666667  0.16666667  0.16666667
0.10606061  0.10606061  0.10606061  0.04545455]]
current label estimation:  [[-1. -1. -1. -1. -1. -1.  1.  1.  1.  1.]]
weighted label estimation:  [[ 0.32125172  0.32125172  0.32125172 -0.52604614 -0.52604614 -0.52604614
0.97803126  0.97803126  0.97803126 -0.32125172]]
errorRate:  0.0

可以看到,虽然默认设置的弱分类器数量为20,但实际只循环了3次就使得错判率为0,这个结果和《统计学习方法》中第140面的例8.1手动计算结果基本是一致的.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python 机器学习