您的位置:首页 > 其它

weka[7] - Adaboost

2014-06-25 19:51 471 查看
前面已经分析完bagging,当然不得不提boosting了。boosting方法中名气最大的要数Adaboost了。

我记得以前看别人博客的时候,有个很形象的比喻,来说明adaboost如何工作的。

adaboost的训练过程,就好比小朋友读文章。同一篇文章,每次阅读的时候,读错的汉字,多读几次(加重weight),然后下一轮读的时候,自然而然能更好地区别。最后,把所有读过的记忆合起来,就能很好地阅读这篇陌生的文章了。

adaboost一般使用的base learner叫做decision stump。它是一棵单层的决策树,weka里也有实现。由于这个算法本身很简单,但是实现因为要考虑各种情况,代码比较长,所以这里不做解析了。

adaboost的优点就是泛化能力强,这个是因为boosting的误差理论上有一个upper bound,跟base learner的个数和base learner的准确度相关。

adaboost的缺点也很明显,那就是对outlier 太敏感。这个也很好理解,因为权重不断加深嘛(这点也可以利用起来做outlier detection)

基本的adaboost是一个二分类器。也有一些改造后的adaboost(比如M1,MH)可用于多分类器。

下面进入正题,来看看weka中的adaboost是如何实现的。

constructor:

public AdaBoostM1() {

m_Classifier = new weka.classifiers.trees.DecisionStump();
}
这里可以看到weka中adaboost的base learner是decision stump

buildClassifier:

public void buildClassifier(Instances data) throws Exception {

super.buildClassifier(data);

// can classifier handle the data?
getCapabilities().testWithFail(data);

// remove instances with missing class
data = new Instances(data);
data.deleteWithMissingClass();

// only class? -> build ZeroR model
if (data.numAttributes() == 1) {
System.err.println(
"Cannot build model (only class attribute present in data!), "
+ "using ZeroR model instead!");
m_ZeroR = new weka.classifiers.rules.ZeroR();
m_ZeroR.buildClassifier(data);
return;
}
else {
m_ZeroR = null;
}

m_NumClasses = data.numClasses();
if ((!m_UseResampling) &&
(m_Classifier instanceof WeightedInstancesHandler)) {
buildClassifierWithWeights(data);
} else {
buildClassifierUsingResampling(data);
}
}
这里主要就是2个函数, buildClassifierWithWeights 和 buildClassifierUsingResampling。

buildClassifierUsingResampling:

protected void buildClassifierUsingResampling(Instances data)
throws Exception {

Instances trainData, sample, training;
double epsilon, reweight, sumProbs;
Evaluation evaluation;
int numInstances = data.numInstances();
Random randomInstance = new Random(m_Seed);
int resamplingIterations = 0;

// Initialize data
m_Betas = new double [m_Classifiers.length];
m_NumIterationsPerformed = 0;
// Create a copy of the data so that when the weights are diddled
// with it doesn't mess up the weights for anyone else
// copy data
training = new Instances(data, 0, numInstances);
sumProbs = training.sumOfWeights();
// weight 归一化
for (int i = 0; i < training.numInstances(); i++) {
training.instance(i).setWeight(training.instance(i).
weight() / sumProbs);
}

// Do boostrap iterations
for (m_NumIterationsPerformed = 0; m_NumIterationsPerformed < m_Classifiers.length;
m_NumIterationsPerformed++) {
if (m_Debug) {
System.err.println("Training classifier " + (m_NumIterationsPerformed + 1));
}

// Select instances to train the classifier on
if (m_WeightThreshold < 100) {
trainData = selectWeightQuantile(training,
(double)m_WeightThreshold / 100);
} else {
trainData = new Instances(training);
}

// Resample
resamplingIterations = 0;
double[] weights = new double[trainData.numInstances()];
for (int i = 0; i < weights.length; i++) {
weights[i] = trainData.instance(i).weight();
}
do {
sample = trainData.resampleWithWeights(randomInstance, weights);

// Build and evaluate classifier
m_Classifiers[m_NumIterationsPerformed].buildClassifier(sample);
evaluation = new Evaluation(data);
evaluation.evaluateModel(m_Classifiers[m_NumIterationsPerformed],
training);
epsilon = evaluation.errorRate();
resamplingIterations++;
} while (Utils.eq(epsilon, 0) &&
(resamplingIterations < MAX_NUM_RESAMPLING_ITERATIONS));

// Stop if error too big or 0
if (Utils.grOrEq(epsilon, 0.5) || Utils.eq(epsilon, 0)) {
if (m_NumIterationsPerformed == 0) {
m_NumIterationsPerformed = 1; // If we're the first we have to to use it
}
break;
}

// Determine the weight to assign to this model
m_Betas[m_NumIterationsPerformed] = Math.log((1 - epsilon) / epsilon);
reweight = (1 - epsilon) / epsilon;
if (m_Debug) {
System.err.println("\terror rate = " + epsilon
+"  beta = " + m_Betas[m_NumIterationsPerformed]);
}

// Update instance weights
setWeights(training, reweight);
}
}
这是使用了resampling的算法。先对原来的数据的weights进行归一化。然后开始循环构造k个decision stump。

构造decision stump的时候,因为要计算该decision stump的错误率(更新下一轮的权重),所以必须保证error>0。另一方,构造这个分类器的时候,样本要先经过放回抽样。

然后得到error后,更新样本权重,继续下一轮循环。知道k次循环结束,最终的k个分类器存放在m_classifier里。

buildClassifierWithWeights:

protected void buildClassifierWithWeights(Instances data)
throws Exception {

Instances trainData, training;
double epsilon, reweight;
Evaluation evaluation;
int numInstances = data.numInstances();
Random randomInstance = new Random(m_Seed);

// Initialize data
m_Betas = new double [m_Classifiers.length];
m_NumIterationsPerformed = 0;

// Create a copy of the data so that when the weights are diddled
// with it doesn't mess up the weights for anyone else
training = new Instances(data, 0, numInstances);

// Do boostrap iterations
for (m_NumIterationsPerformed = 0; m_NumIterationsPerformed < m_Classifiers.length;
m_NumIterationsPerformed++) {
if (m_Debug) {
System.err.println("Training classifier " + (m_NumIterationsPerformed + 1));
}
// Select instances to train the classifier on
if (m_WeightThreshold < 100) {
trainData = selectWeightQuantile(training,
(double)m_WeightThreshold / 100);
} else {
trainData = new Instances(training, 0, numInstances);
}

// Build the classifier
if (m_Classifiers[m_NumIterationsPerformed] instanceof Randomizable)
((Randomizable) m_Classifiers[m_NumIterationsPerformed]).setSeed(randomInstance.nextInt());
m_Classifiers[m_NumIterationsPerformed].buildClassifier(trainData);

// Evaluate the classifier
evaluation = new Evaluation(data);
evaluation.evaluateModel(m_Classifiers[m_NumIterationsPerformed], training);
epsilon = evaluation.errorRate();

// Stop if error too small or error too big and ignore this model
if (Utils.grOrEq(epsilon, 0.5) || Utils.eq(epsilon, 0)) {
if (m_NumIterationsPerformed == 0) {
m_NumIterationsPerformed = 1; // If we're the first we have to to use it
}
break;
}
// Determine the weight to assign to this model
m_Betas[m_NumIterationsPerformed] = Math.log((1 - epsilon) / epsilon);
reweight = (1 - epsilon) / epsilon;
if (m_Debug) {
System.err.println("\terror rate = " + epsilon
+"  beta = " + m_Betas[m_NumIterationsPerformed]);
}

// Update instance weights
setWeights(training, reweight);
}
}
这个基本和上面的一样!其实两者的区别就是,前者的训练样本是resample来的,后者的样本就是原数据集。

最后说下,为什么weka里,叫做AdaboostM1。

M1其实就是base learner是多分类的情况,这里的decision stump是支持多分类的。

而MH其实就是把原来的类别转化成{+1,-1}^n的形式,相当于分成n个二分类器做boost。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: