您的位置:首页 > 编程语言 > Python开发

Python机器学习之XGBoost从入门到实战(代码实现)

2017-11-19 21:00 931 查看
# -*- coding: utf-8 -*-
__author__ = 'gerry'

'''
XGBoost案例之蘑菇是否有毒
任务:根据蘑菇的22个特征判断蘑菇是否有毒
数据介绍:
总样本数:8124
-可食用:4208,51.8%
-有毒:3916,48.2%

-训练样本:6513
-测试样本:1611
'''
#导入需要的工具包
import xgboost as xgb
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

#将数据从文件中读出,并为XGBoost训练准备好

my_workpath = './data/'
dtrain = xgb.DMatrix(my_workpath+'agaricus.txt.train')
dtest = xgb.DMatrix(my_workpath+'agaricus.txt.test')

'''
该数据为libsvm格式的文本数据,libsvm的文件格式(稀疏特征)
-每一行为一个样本:1 3:1 9:1 19:1 21:1 30:1
* 开头的"1"是样本的标签。3,9位特征索引,1,1为特征的值
* 在两类分类中,用1表示正样本,0表示负样本,也支持用[0,1]表示概率用来做标签,表示正样本的概率
XGBoost加载的数据对象存储在对象Dmatrix中,做了存储效率和运行速度的优化
支持三种数据接口:
* libsvm.txt格式数据文件
* 常规矩阵(numpy 2D array)
* xgboost binary buffer file
'''

#设置训练参数
# specify parameters via map
param = {
'max_depth':3,
'eta':1,
'silent':0,
'objective':'binary:logistic'
}

'''
max_depth:树的最大深度,缺省值为6,取值范围:[1,∞]
eta:为了防止过拟合,更新过程用到的收缩步长,eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3,取值范围为[0,1]
silent:0表示打印出运行时信息,1表示以缄默方式运行,缺省值为0
objective:定义学习任务以及相应的学习目标,'binary:logistic'表示二分类的逻辑回归问题,输出为概率

'''

# 模型训练
# 设置boosting迭代计算参数
num_round = 2
bst = xgb.train(param,dtrain,num_round)

'''
与scikit-learn结合
-XGBoost提供一个wrapper类,允许模型可以和scikit-learn框架中的其他分类器或者回归器一样对待
XGBoost中分类器为XGBClassifier-模型在构造时传递

'''

#bst = xgb.XGBClassifier(max_depth=2,learning_rate=1,n_estimators=num_round,silent=True,objective='binary:logistic')

#预测(训练数据上评估 )
# 模型训练好后,可以用训练好的模型对进行预测
# XGBoost预测的输出时概率,输出值是样本为第一类的概率-->将其概率值转换为0或1

train_preds = bst.predict(dtrain)
train_predictions = [round(value) for value in train_preds]
y_train = dtrain.get_label()
train_accuracy = accuracy_score(y_train,train_predictions)

print("Train Accuracy:%.2f%%"%(train_accuracy*100.0))

#预测(测试集上预测)

preds = bst.predict(dtest)
predictions = [round(value) for value in preds]
y_test = dtest.get_label()
test_accuracy = accuracy_score(y_test,predictions)

print("Test Accuracy:%.2f%%"%(test_accuracy*100.0))

# 模型可视化
'''
可视化模型中的单课树:调用XGBoost的API plot_tree()/to_graphviz()

'''
xgb.plot_tree(bst,num_trees=0,rankdir='LR')
xgb.plot_importance(bst)
plt.show()
'''
* 第一个参数为训练好的模型
* 第二个参数为要打印的树的索引(从0开始)
* 第三个参数是打印的格式
'''


# -*- coding: utf-8 -*-
__author__ = 'gerry'

'''
XGBoost快速入门-与scikit-learn一起使用

'''
from xgboost import XGBClassifier

#加载LibSVM格式数据模块
from sklearn.datasets  import load_svmlight_file
from sklearn.metrics import accuracy_score
from matplotlib import pyplot

my_workpath = './data/'
X_train,Y_train = load_svmlight_file(my_workpath+'agaricus.txt.train')
X_test,Y_test = load_svmlight_file(my_workpath+'agaricus.txt.test')

print X_train.shape
print X_test.shape

#设置boosting迭代次数
num_round = 2

bst = XGBClassifier(max_depth=2,learning_rate=1,n_estimators=num_round,silent=True,objective='binary:logistic')

bst.fit(X_train,Y_train)

# XGBoost预测出的是概率,这里蘑菇分类是一个二分类问题,输出值是样本为第一类的概率,我们需要将概率值转换为0或1
train_preds = bst.predict(X_train)
train_predictions = [round(value) for value in train_preds]
train_accuracy = accuracy_score(Y_train,train_predictions)

print("Train Accuracy:%.2f%%"%(train_accuracy*100.0))

#预测(测试集上预测)

preds = bst.predict(X_test)
predictions = [round(value) for value in preds]
test_accuracy = accuracy_score(Y_test,predictions)

print("Test Accuracy:%.2f%%"%(test_accuracy*100.0))


# -*- coding: utf-8 -*-
__author__ = 'gerry'

'''
前面两个例子在训练集和测试集上都检查了模型的性能
实际场景中测试数据是未知的,如何评估模型?
-答案:校验集
校验集:将训练数据的一部分留出来,不参与模型参数训练

'''

from xgboost import XGBClassifier
from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

my_workpath = './data/'
X_train, Y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train')
X_test, Y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test')

print X_train.shape
print X_test.shape

'''
训练集测试集分离
'''

# split data into train and test sets,1/3的训练数据作为校验数据
seed = 7
test_size = 0.33

X_train_part, X_validate, y_train_part, y_validate = train_test_split(X_train, Y_train, test_size=test_size,
random_state=seed)
print(X_train_part.shape)

# 设置boosting迭代次数
num_round = 2

bst = XGBClassifier(max_depth=2, learning_rate=1, n_estimators=num_round, silent=True, objective='binary:logistic')

bst.fit(X_train_part, y_train_part)
ecf3

# XGBoost预测出的是概率,这里蘑菇分类是一个二分类问题,输出值是样本为第一类的概率,我们需要将概率值转换为0或1
# 校验集上的性能
validare_preds = bst.predict(X_validate)
validare_predictions = [round(value) for value in validare_preds]
validare_accuracy = accuracy_score(y_validate, validare_predictions)

print("validare Accuracy:%.2f%%" % (validare_accuracy * 100.0))

# 训练集上的性能

train_preds = bst.predict(X_train_part)
train_predictions = [round(value) for value in train_preds]
train_accuracy = accuracy_score(y_train_part, train_predictions)

print("Train Accuracy:%.2f%%" % (train_accuracy * 100.0))

# 预测(测试集上预测)

preds = bst.predict(X_test)
predictions = [round(value) for value in preds]
test_accuracy = accuracy_score(Y_test, predictions)

print("Test Accuracy:%.2f%%" % (test_accuracy * 100.0))


# -*- coding: utf-8 -*-
__author__ = 'gerry'

'''
学习曲线
模型预测性能随某个变化的学习参数(如训练样本数目、迭代次数)变化情况
例如XGBoosts的迭代次数(树的数目)

'''

from xgboost import XGBClassifier
from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import  matplotlib.pyplot as plt

my_workpath = './data/'
X_train, Y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train')
X_test, Y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test')

print X_train.shape
print X_test.shape

'''
训练集测试集分离
'''

# split data into train and test sets,1/3的训练数据作为校验数据
seed = 7
test_size = 0.33

X_train_part, X_validate, y_train_part, y_validate = train_test_split(X_train, Y_train, test_size=test_size,
random_state=seed)
print(X_train_part.shape)

# 设置boosting迭代次数
num_round = 100

bst = XGBClassifier(max_depth=2, learning_rate=1, n_estimators=num_round, silent=True, objective='binary:logistic')

eval_set = [(X_train_part,y_train_part),(X_validate,y_validate)]
bst.fit(X_train_part, y_train_part,eval_metric=["error","logloss"],eval_set=eval_set,verbose=True)

#显示学习曲线
#retrive performance matrics
results = bst.evals_result()
print(results)

epochs_logloss = len(results['validation_0']['logloss'])
epochs_error = len(results['validation_0']['error'])
print(epochs_logloss)
print(epochs_error)
x_axis_logloss = range(0,epochs_logloss)
x_axis_error = range(0,epochs_error)

#plot log loss
fig,ax = plt.subplots()
ax.plot(x_axis_logloss,results['validation_0']['logloss'],label='Train')
ax.plot(x_axis_logloss,results['validation_1']['logloss'],label='Test')
ax.legend()
plt.ylabel('Log Loss')
plt.title('XGBoost Log Loss')
plt.show()

#plot classification error
fig,ax = plt.subplots()
ax.plot(x_axis_error, results['validation_0']['error'], label='Train')
ax.plot(x_axis_error, results['validation_1']['error'], label='Test')
ax.legend()
plt.ylabel('Classification Error')
plt.title('XGBoost Classification Error')
plt.show()

# make prediction
preds = bst.predict(X_test)
predictions = [round(value) for value in preds]

test_accuracy = accuracy_score(Y_test, predictions)
print("Test Accuracy: %.2f%%" % (test_accuracy * 100.0))


# -*- coding: utf-8 -*-
__author__ = 'gerry'

'''
Early stop:一种防止训练复杂模型过拟合的方法
-监控模型在校验集上的性能:如果在经过固定次数的迭代,校验集上的性能不再提高时,结束训练过程
-当在测试集上的训练下降而在训练集上的性能还提高时,发生了过拟合

'''

from xgboost import XGBClassifier
from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import  matplotlib.pyplot as plt

my_workpath = './data/'
X_train, Y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train')
X_test, Y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test')

print X_train.shape
print X_test.shape

'''
训练集测试集分离
'''

# split data into train and test sets,1/3的训练数据作为校验数据
seed = 7
test_size = 0.33

X_train_part, X_validate, y_train_part, y_validate = train_test_split(X_train, Y_train, test_size=test_size,
random_state=seed)
print(X_train_part.shape)

# 设置boosting迭代次数
num_round = 100
#bst = XGBClassifier(param)
#bst = XGBClassifier()
bst =XGBClassifier(max_depth=2, learning_rate=0.1, n_estimators=num_round, silent=True, objective='binary:logistic')
eval_set =[(X_validate, y_validate)]
bst.fit(X_train_part, y_train_part, early_stopping_rounds=10, eval_metric="error", eval_set=eval_set, verbose=True)

# retrieve performance metrics
results = bst.evals_result()
#print(results)

epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)

# plot log loss
fig, ax = plt.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Test')
ax.legend()
plt.ylabel('Error')
plt.xlabel('Round')
plt.title('XGBoost Early Stop')
plt.show()


# -*- coding: utf-8 -*-
__author__ = 'gerry'

'''
交叉验证:
train_test_split将训练数据的一部分流出来做校验,不参与模型参数训练
-优点:速度快
-缺点:训练数据少,一次校验集的划分会带来随机性
答案:交叉验证(cross-valisation,CV),但训练时间长
-适合训练数据规模较大的情况(如上百万条记录)
-适合训练慢的机器学习模型******
K-折交叉验证:将训练数据等分为k份(k通常的取值为3,5,10)
-重复k次
*每次流出一份做校验,其余k-1份做训练
-k次校验集上的平均性能视为模型在测试集上性能的估计
* 该估计比train_test_split得到的估计方差更小
'''

'''
K-折交叉验证
-重复k次
* 每次留出一份做校验,其余k-1次做训练

-k次校验集上的平均性能视为模型在测试集上性能的估计
* k次结果可能得到性能估计的均值和该估计的方差

'''

from xgboost import XGBClassifier
from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import cross_val_score #对给定的参数的单个模型进行评估
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold

#注意:如果每类样本不均衡或者类别数目较多,采用StratifiedKFold,将数据集中每一类样本的数据等分

from sklearn.metrics import accuracy_score
import  matplotlib.pyplot as plt

my_workpath = './data/'
X_train, Y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train')
X_test, Y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test')

#构造模型
#设置Boosting迭代计算次数
num_round = 2
#num_round = rang(1,101)
# param_grid = dict(n_estimators=num_round)
#bst = XGBClassifier(param)
bst = XGBClassifier(max_depth=2, learning_rate=1, n_estimators=num_round, silent=True, objective='binary:logistic')

# 交叉验证--会比较慢
# stratified k-fold cross validation evaluation of xgboost model
kfold = KFold(n_splits=10, random_state=7)
#kfold = StratifiedKFold(n_splits=10,random_state=7)
fit_params = {'eval_metric':"logloss"}
results = cross_val_score(bst, X_train, Y_train, cv=kfold)
#results = cross_val_score(bst,X_train,Y_train,cv=kfold)
print(results)
print("CV Accuracy:%2f%% (%.2f%%)" %(results.mean()*100,results.std()*100)


# -*- coding: utf-8 -*-
__author__ = 'gerry'

'''
参数调优GridSearcnCV:我们可以根据交叉验证评估结果选择最佳参数模型
-输入待调节参数的范围(grid),对一组参数对应的模型进行评估,并给出最佳模型及参数

'''
# 运行 xgboost安装包中的示例程序
from xgboost import XGBClassifier

# 加载LibSVM格式数据模块
from sklearn.datasets import load_svmlight_file

from sklearn.grid_search import GridSearchCV

from sklearn.metrics import accuracy_score

from matplotlib import pyplot

# read in data,数据在xgboost安装的路径下的demo目录,现在copy到代码目录下的data目录
my_workpath = './data/'
X_train,y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train')
X_test,y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test')

#设置模型训练参数
# specify parameters via map
params = {'max_depth':2, 'eta':0.1, 'silent':0, 'objective':'binary:logistic' }
print params

#构造模型
#bst = XGBClassifier(param)
bst =XGBClassifier(max_depth=2, learning_rate=0.1, silent=True, objective='binary:logistic')

#交叉验证

#设置boosting迭代参数
param_test = {
'n_estimators':range(1,51,1)
}
clf = GridSearchCV(estimator=bst,param_grid=param_test,scoring='accuracy',cv=5)
clf.fit(X_train,y_train)
print(clf.grid_scores_,clf.best_estimator_,clf.best_score_)

#测试

#make prediction
preds = clf.predict(X_test)
predictions = [round(value) for value in preds]

test_accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy of gridsearchcv: %.2f%%" % (test_accuracy * 100.0))
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python 机器学习