Python机器学习之XGBoost从入门到实战(代码实现)
2017-11-19 21:00
931 查看
# -*- coding: utf-8 -*- __author__ = 'gerry' ''' XGBoost案例之蘑菇是否有毒 任务:根据蘑菇的22个特征判断蘑菇是否有毒 数据介绍: 总样本数:8124 -可食用:4208,51.8% -有毒:3916,48.2% -训练样本:6513 -测试样本:1611 ''' #导入需要的工具包 import xgboost as xgb from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt #将数据从文件中读出,并为XGBoost训练准备好 my_workpath = './data/' dtrain = xgb.DMatrix(my_workpath+'agaricus.txt.train') dtest = xgb.DMatrix(my_workpath+'agaricus.txt.test') ''' 该数据为libsvm格式的文本数据,libsvm的文件格式(稀疏特征) -每一行为一个样本:1 3:1 9:1 19:1 21:1 30:1 * 开头的"1"是样本的标签。3,9位特征索引,1,1为特征的值 * 在两类分类中,用1表示正样本,0表示负样本,也支持用[0,1]表示概率用来做标签,表示正样本的概率 XGBoost加载的数据对象存储在对象Dmatrix中,做了存储效率和运行速度的优化 支持三种数据接口: * libsvm.txt格式数据文件 * 常规矩阵(numpy 2D array) * xgboost binary buffer file ''' #设置训练参数 # specify parameters via map param = { 'max_depth':3, 'eta':1, 'silent':0, 'objective':'binary:logistic' } ''' max_depth:树的最大深度,缺省值为6,取值范围:[1,∞] eta:为了防止过拟合,更新过程用到的收缩步长,eta通过缩减特征的权重使提升计算过程更加保守。缺省值为0.3,取值范围为[0,1] silent:0表示打印出运行时信息,1表示以缄默方式运行,缺省值为0 objective:定义学习任务以及相应的学习目标,'binary:logistic'表示二分类的逻辑回归问题,输出为概率 ''' # 模型训练 # 设置boosting迭代计算参数 num_round = 2 bst = xgb.train(param,dtrain,num_round) ''' 与scikit-learn结合 -XGBoost提供一个wrapper类,允许模型可以和scikit-learn框架中的其他分类器或者回归器一样对待 XGBoost中分类器为XGBClassifier-模型在构造时传递 ''' #bst = xgb.XGBClassifier(max_depth=2,learning_rate=1,n_estimators=num_round,silent=True,objective='binary:logistic') #预测(训练数据上评估 ) # 模型训练好后,可以用训练好的模型对进行预测 # XGBoost预测的输出时概率,输出值是样本为第一类的概率-->将其概率值转换为0或1 train_preds = bst.predict(dtrain) train_predictions = [round(value) for value in train_preds] y_train = dtrain.get_label() train_accuracy = accuracy_score(y_train,train_predictions) print("Train Accuracy:%.2f%%"%(train_accuracy*100.0)) #预测(测试集上预测) preds = bst.predict(dtest) predictions = [round(value) for value in preds] y_test = dtest.get_label() test_accuracy = accuracy_score(y_test,predictions) print("Test Accuracy:%.2f%%"%(test_accuracy*100.0)) # 模型可视化 ''' 可视化模型中的单课树:调用XGBoost的API plot_tree()/to_graphviz() ''' xgb.plot_tree(bst,num_trees=0,rankdir='LR') xgb.plot_importance(bst) plt.show() ''' * 第一个参数为训练好的模型 * 第二个参数为要打印的树的索引(从0开始) * 第三个参数是打印的格式 '''
# -*- coding: utf-8 -*- __author__ = 'gerry' ''' XGBoost快速入门-与scikit-learn一起使用 ''' from xgboost import XGBClassifier #加载LibSVM格式数据模块 from sklearn.datasets import load_svmlight_file from sklearn.metrics import accuracy_score from matplotlib import pyplot my_workpath = './data/' X_train,Y_train = load_svmlight_file(my_workpath+'agaricus.txt.train') X_test,Y_test = load_svmlight_file(my_workpath+'agaricus.txt.test') print X_train.shape print X_test.shape #设置boosting迭代次数 num_round = 2 bst = XGBClassifier(max_depth=2,learning_rate=1,n_estimators=num_round,silent=True,objective='binary:logistic') bst.fit(X_train,Y_train) # XGBoost预测出的是概率,这里蘑菇分类是一个二分类问题,输出值是样本为第一类的概率,我们需要将概率值转换为0或1 train_preds = bst.predict(X_train) train_predictions = [round(value) for value in train_preds] train_accuracy = accuracy_score(Y_train,train_predictions) print("Train Accuracy:%.2f%%"%(train_accuracy*100.0)) #预测(测试集上预测) preds = bst.predict(X_test) predictions = [round(value) for value in preds] test_accuracy = accuracy_score(Y_test,predictions) print("Test Accuracy:%.2f%%"%(test_accuracy*100.0))
# -*- coding: utf-8 -*- __author__ = 'gerry' ''' 前面两个例子在训练集和测试集上都检查了模型的性能 实际场景中测试数据是未知的,如何评估模型? -答案:校验集 校验集:将训练数据的一部分留出来,不参与模型参数训练 ''' from xgboost import XGBClassifier from sklearn.datasets import load_svmlight_file from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score my_workpath = './data/' X_train, Y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train') X_test, Y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test') print X_train.shape print X_test.shape ''' 训练集测试集分离 ''' # split data into train and test sets,1/3的训练数据作为校验数据 seed = 7 test_size = 0.33 X_train_part, X_validate, y_train_part, y_validate = train_test_split(X_train, Y_train, test_size=test_size, random_state=seed) print(X_train_part.shape) # 设置boosting迭代次数 num_round = 2 bst = XGBClassifier(max_depth=2, learning_rate=1, n_estimators=num_round, silent=True, objective='binary:logistic') bst.fit(X_train_part, y_train_part) ecf3 # XGBoost预测出的是概率,这里蘑菇分类是一个二分类问题,输出值是样本为第一类的概率,我们需要将概率值转换为0或1 # 校验集上的性能 validare_preds = bst.predict(X_validate) validare_predictions = [round(value) for value in validare_preds] validare_accuracy = accuracy_score(y_validate, validare_predictions) print("validare Accuracy:%.2f%%" % (validare_accuracy * 100.0)) # 训练集上的性能 train_preds = bst.predict(X_train_part) train_predictions = [round(value) for value in train_preds] train_accuracy = accuracy_score(y_train_part, train_predictions) print("Train Accuracy:%.2f%%" % (train_accuracy * 100.0)) # 预测(测试集上预测) preds = bst.predict(X_test) predictions = [round(value) for value in preds] test_accuracy = accuracy_score(Y_test, predictions) print("Test Accuracy:%.2f%%" % (test_accuracy * 100.0))
# -*- coding: utf-8 -*- __author__ = 'gerry' ''' 学习曲线 模型预测性能随某个变化的学习参数(如训练样本数目、迭代次数)变化情况 例如XGBoosts的迭代次数(树的数目) ''' from xgboost import XGBClassifier from sklearn.datasets import load_svmlight_file from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt my_workpath = './data/' X_train, Y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train') X_test, Y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test') print X_train.shape print X_test.shape ''' 训练集测试集分离 ''' # split data into train and test sets,1/3的训练数据作为校验数据 seed = 7 test_size = 0.33 X_train_part, X_validate, y_train_part, y_validate = train_test_split(X_train, Y_train, test_size=test_size, random_state=seed) print(X_train_part.shape) # 设置boosting迭代次数 num_round = 100 bst = XGBClassifier(max_depth=2, learning_rate=1, n_estimators=num_round, silent=True, objective='binary:logistic') eval_set = [(X_train_part,y_train_part),(X_validate,y_validate)] bst.fit(X_train_part, y_train_part,eval_metric=["error","logloss"],eval_set=eval_set,verbose=True) #显示学习曲线 #retrive performance matrics results = bst.evals_result() print(results) epochs_logloss = len(results['validation_0']['logloss']) epochs_error = len(results['validation_0']['error']) print(epochs_logloss) print(epochs_error) x_axis_logloss = range(0,epochs_logloss) x_axis_error = range(0,epochs_error) #plot log loss fig,ax = plt.subplots() ax.plot(x_axis_logloss,results['validation_0']['logloss'],label='Train') ax.plot(x_axis_logloss,results['validation_1']['logloss'],label='Test') ax.legend() plt.ylabel('Log Loss') plt.title('XGBoost Log Loss') plt.show() #plot classification error fig,ax = plt.subplots() ax.plot(x_axis_error, results['validation_0']['error'], label='Train') ax.plot(x_axis_error, results['validation_1']['error'], label='Test') ax.legend() plt.ylabel('Classification Error') plt.title('XGBoost Classification Error') plt.show() # make prediction preds = bst.predict(X_test) predictions = [round(value) for value in preds] test_accuracy = accuracy_score(Y_test, predictions) print("Test Accuracy: %.2f%%" % (test_accuracy * 100.0))
# -*- coding: utf-8 -*- __author__ = 'gerry' ''' Early stop:一种防止训练复杂模型过拟合的方法 -监控模型在校验集上的性能:如果在经过固定次数的迭代,校验集上的性能不再提高时,结束训练过程 -当在测试集上的训练下降而在训练集上的性能还提高时,发生了过拟合 ''' from xgboost import XGBClassifier from sklearn.datasets import load_svmlight_file from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt my_workpath = './data/' X_train, Y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train') X_test, Y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test') print X_train.shape print X_test.shape ''' 训练集测试集分离 ''' # split data into train and test sets,1/3的训练数据作为校验数据 seed = 7 test_size = 0.33 X_train_part, X_validate, y_train_part, y_validate = train_test_split(X_train, Y_train, test_size=test_size, random_state=seed) print(X_train_part.shape) # 设置boosting迭代次数 num_round = 100 #bst = XGBClassifier(param) #bst = XGBClassifier() bst =XGBClassifier(max_depth=2, learning_rate=0.1, n_estimators=num_round, silent=True, objective='binary:logistic') eval_set =[(X_validate, y_validate)] bst.fit(X_train_part, y_train_part, early_stopping_rounds=10, eval_metric="error", eval_set=eval_set, verbose=True) # retrieve performance metrics results = bst.evals_result() #print(results) epochs = len(results['validation_0']['error']) x_axis = range(0, epochs) # plot log loss fig, ax = plt.subplots() ax.plot(x_axis, results['validation_0']['error'], label='Test') ax.legend() plt.ylabel('Error') plt.xlabel('Round') plt.title('XGBoost Early Stop') plt.show()
# -*- coding: utf-8 -*- __author__ = 'gerry' ''' 交叉验证: train_test_split将训练数据的一部分流出来做校验,不参与模型参数训练 -优点:速度快 -缺点:训练数据少,一次校验集的划分会带来随机性 答案:交叉验证(cross-valisation,CV),但训练时间长 -适合训练数据规模较大的情况(如上百万条记录) -适合训练慢的机器学习模型****** K-折交叉验证:将训练数据等分为k份(k通常的取值为3,5,10) -重复k次 *每次流出一份做校验,其余k-1份做训练 -k次校验集上的平均性能视为模型在测试集上性能的估计 * 该估计比train_test_split得到的估计方差更小 ''' ''' K-折交叉验证 -重复k次 * 每次留出一份做校验,其余k-1次做训练 -k次校验集上的平均性能视为模型在测试集上性能的估计 * k次结果可能得到性能估计的均值和该估计的方差 ''' from xgboost import XGBClassifier from sklearn.datasets import load_svmlight_file from sklearn.model_selection import cross_val_score #对给定的参数的单个模型进行评估 from sklearn.model_selection import KFold from sklearn.model_selection import StratifiedKFold #注意:如果每类样本不均衡或者类别数目较多,采用StratifiedKFold,将数据集中每一类样本的数据等分 from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt my_workpath = './data/' X_train, Y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train') X_test, Y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test') #构造模型 #设置Boosting迭代计算次数 num_round = 2 #num_round = rang(1,101) # param_grid = dict(n_estimators=num_round) #bst = XGBClassifier(param) bst = XGBClassifier(max_depth=2, learning_rate=1, n_estimators=num_round, silent=True, objective='binary:logistic') # 交叉验证--会比较慢 # stratified k-fold cross validation evaluation of xgboost model kfold = KFold(n_splits=10, random_state=7) #kfold = StratifiedKFold(n_splits=10,random_state=7) fit_params = {'eval_metric':"logloss"} results = cross_val_score(bst, X_train, Y_train, cv=kfold) #results = cross_val_score(bst,X_train,Y_train,cv=kfold) print(results) print("CV Accuracy:%2f%% (%.2f%%)" %(results.mean()*100,results.std()*100)
# -*- coding: utf-8 -*- __author__ = 'gerry' ''' 参数调优GridSearcnCV:我们可以根据交叉验证评估结果选择最佳参数模型 -输入待调节参数的范围(grid),对一组参数对应的模型进行评估,并给出最佳模型及参数 ''' # 运行 xgboost安装包中的示例程序 from xgboost import XGBClassifier # 加载LibSVM格式数据模块 from sklearn.datasets import load_svmlight_file from sklearn.grid_search import GridSearchCV from sklearn.metrics import accuracy_score from matplotlib import pyplot # read in data,数据在xgboost安装的路径下的demo目录,现在copy到代码目录下的data目录 my_workpath = './data/' X_train,y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train') X_test,y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test') #设置模型训练参数 # specify parameters via map params = {'max_depth':2, 'eta':0.1, 'silent':0, 'objective':'binary:logistic' } print params #构造模型 #bst = XGBClassifier(param) bst =XGBClassifier(max_depth=2, learning_rate=0.1, silent=True, objective='binary:logistic') #交叉验证 #设置boosting迭代参数 param_test = { 'n_estimators':range(1,51,1) } clf = GridSearchCV(estimator=bst,param_grid=param_test,scoring='accuracy',cv=5) clf.fit(X_train,y_train) print(clf.grid_scores_,clf.best_estimator_,clf.best_score_) #测试 #make prediction preds = clf.predict(X_test) predictions = [round(value) for value in preds] test_accuracy = accuracy_score(y_test, predictions) print("Test Accuracy of gridsearchcv: %.2f%%" % (test_accuracy * 100.0))
相关文章推荐
- Python机器学习之XGBoost从入门到实战(基本理论说明)
- xgboost入门与实战(实战调参篇) 标签: xgboostpythonkaggle机器学习
- 机器学习系列(12)_XGBoost参数调优完全指南(附Python代码)
- 机器学习系列(12)_XGBoost参数调优完全指南(附Python代码)
- 机器学习实战笔记(Python实现)-01-机器学习实战
- 用Python Scikit-learn 实现机器学习十大算法--朴素贝叶斯算法(文末有代码)
- 【机器学习实战-kNN:约会网站约友分类】python3实现-书本知识【2】
- 机器学习实战笔记(Python实现)-02-k近邻算法(kNN)
- 机器学习实战-边学边读python代码(4)
- 机器学习实战笔记(Python实现)-09-树回归
- 【机器学习】 之 xgboost python ubuntu部署
- xgboost入门与实战(原理篇)
- 机器学习实战笔记(Python实现)-01-机器学习实战
- D-Bus入门(四)——QTDBUS代码,实现ofono代码下的python测试文件activite-context的功能
- 【机器学习实战-kNN:手写识别】python3实现-书本知识【3】
- xgboost入门与实战(实战调参篇)
- 机器学习实战python版本matplotlib安装遇到的各种问题和代码演示
- 机器学习实战-边学边读python代码(3)
- 基于Python的Xgboost模型实现
- 机器学习实战(5)--SVM(Support vector machine)(六)--Python实现