您的位置：首页 > 其它

R︱Yandex的梯度提升CatBoost 算法（官方述：超越XGBoost/lightGBM/h2o）

2017-07-21 11:18 489 查看

俄罗斯搜索巨头 Yandex 昨日宣布开源 CatBoost ，这是一种支持类别特征，基于梯度提升决策树的机器学习方法。

CatBoost 是由 Yandex 的研究人员和工程师开发的，是 MatrixNet 算法的继承者，在公司内部广泛使用，用于排列任务、预测和提出建议。Yandex 称其是通用的，可应用于广泛的领域和各种各样的问题。

CatBoost 的主要优势：

与其他库相比，质量上乘

支持数字化和分类功能

带有数据可视化工具

官网：https://tech.yandex.com/CatBoost/

github:https://github.com/catboost/catboost

有R/python两个版本，官方自述超越现有的最好的三个ML库：XGBoost/lightGBM/h2o

衡量标准为： Logloss 越小越好：

默认参数解析（[github](https://github.com/catboost/benchmarks/blob/master/comparison_description.pdf)）：

安装

在window笔者遇到了：

* installing *source* package 'catboost' ...
** libs
running 'src/Makefile.win' ...
/cygdrive/c/Users/mzheng50/Desktop/R-package/src/../../../ya.bat make -r -o ../../..
make: /cygdrive/c/Users/mzheng50/Desktop/R-package/src/../../../ya.bat: Command not found
make: *** [all] Error 127
警告: 运行命令'make --no-print-directory -f "Makefile.win"'的状态是2
ERROR: compilation failed for package 'catboost'
* removing 'C:/Users/mzheng50/Documents/R/win-library/3.1/catboost'
Error: Command failed (1)

在Linux用下面code可以一气呵成：

devtools::install_github('catboost/catboost', subdir = 'catboost/R-package')

一个官方案例：

library(caret)
library(titanic)
library(catboost)

set.seed(12345)

data <- as.data.frame(as.matrix(titanic_train), stringsAsFactors = TRUE)

drop_columns = c("PassengerId", "Survived", "Name", "Ticket", "Cabin")
x <- data[,!(names(data) %in% drop_columns)]
y <- data[,c("Survived")]

fit_control <- trainControl(method = "cv",
number = 4,
classProbs = TRUE)

grid <- expand.grid(depth = c(4, 6, 8),
learning_rate = 0.1,
iterations = 100,
l2_leaf_reg = 1e-3,
rsm = 0.95,
border_count = 64)

report <- train(x, as.factor(make.names(y)),
method = catboost.caret,
verbose = TRUE, preProc = NULL,
tuneGrid = grid, trControl = fit_control)

print(report)
--------------------------
> Catboost
>
> 891 samples   7 predictors   2 classes: 'X0', 'X1'
>
> No pre-processing Resampling: Cross-Validated (4 fold) Summary of
> sample sizes: 669, 668, 668, 668 Resampling results across tuning
> parameters:
>
>   depth  Accuracy   Kappa   4      0.8091544  0.5861049   6
> 0.8035642  0.5728401   8      0.7026674  0.2672683
>
> Tuning parameter 'learning_rate' was held constant at a value of 0.1
>
> Tuning parameter 'rsm' was held constant at a value of 0.95 Tuning
> parameter 'border_count' was held constant at a value of 64 Accuracy
> was used to select the optimal model using  the largest value. The
> final values used for the model were depth = 4, learning_rate =
>  0.1, iterations = 100, l2_leaf_reg = 0.001, rsm = 0.95 and border_count = 64.

importance <- varImp(report, scale = FALSE)
print(importance)
--------------------------
custom variable importance
Overall
Fare      25.918
Parch     19.419
Sex       17.999
Pclass    17.410
Age       10.372
Embarked   5.879
SibSp      3.004

公众号“素质云笔记”定期更新博客内容：

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 算法 yandex 机器学习 catBoost xgboost

相关文章推荐

新的分享

章节导航

R︱Yandex的梯度提升CatBoost 算法（官方述：超越XGBoost/lightGBM/h2o）

笔者相关文章：

CatBoost 的主要优势：

衡量标准为： Logloss 越小越好：

默认参数解析（[github](https://github.com/catboost/benchmarks/blob/master/comparison_description.pdf)）：

安装

一个官方案例：