CatBoost: A machine learning library to handle categorical (CAT) data automatically MACHINE LEARNING
2017-10-29 16:57
716 查看
Introduction
How many of you have seen this error while building your machine learning models using “sklearn”?I bet most of us! At least in the initial days.
This error occurs when dealing with categorical (string) variables. In sklearn, you are required to convert these categories in the numerical format.
In order to do this conversion, we use several pre-processing methods like “label encoding”, “one hot encoding” and others.
In this article, I will discuss a recently open sourced library ” CatBoost” developed and contributed by Yandex. CatBoost can use categorical features directly and is scalable in nature.
“This is the first Russian machine learning technology that’s an open source,” said Mikhail Bilenko, Yandex’s head of machine intelligence and research.
P.S. You can also read this article written by me before “How to deal with categorical variables?“.
Table of Contents
What is CatBoost?Advantages of CatBoost library
CatBoost in comparison to other boosting algorithms
Installing CatBoost
Solving ML challenge using CatBoost
End Notes
1. What is CatBoost?
CatBoost is a recently open-sourced machine learning algorithm from Yandex. It can easily integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. It can work with diverse data types to help solve a wide range of problems that businesses face today. To top it up, it provides best-in-class accuracy.It is especially powerful in two ways:
It yields state-of-the-art results without extensive data training
typically required by other machine learning methods, and
Provides powerful out-of-the-box support for the more descriptive
data formats that accompany many business problems.
“CatBoost” name comes from two words “**Cat**egory” and “**Boost**ing”.
As discussed, the library works well with multiple Categories of data, such as audio, text, image including historical data.
“Boost” comes from gradient boosting machine learning algorithm as this library is based on gradient boosting library. Gradient boosting is a powerful machine learning algorithm that is widely applied to multiple types of business challenges like fraud detection, recommendation items, forecasting and it performs well also. It can also return very good result with relatively less data, unlike DL models that need to learn from a massive amount of data.
Here is a video message of Mikhail Bilenko, Yandex’s head of machine intelligence and research and Anna Veronika Dorogush, Head of Tandex machine learning systems.
相关文章推荐
- How to use data analysis for machine learning (example, part 1)
- How do you explain Machine Learning and Data Mining to non Computer Science people?
- How To Load CSV Machine Learning Data in Weka (如何在Weka中加载CSV机器学习数据)
- What are the best talks/lectures related to big data/algorithms/machine learning?
- Shark Machine Learning Library --之运行篇
- Shark Machine Learning Library 安装配置运行
- Beyond the C++ Standard Library: An Introduction to Boost by Bjцrn Karlsson
- Note for video Machine Learning and Data Mining——Linear Model
- Note for video Machine Learning and Data Mining——error and noise
- SimpleCV install and "You need the python image library to save by filehandle"
- Shark Machine Learning Library 安装配置运行
- How To Handle MLOG$_AP_SUPPLIER_SITES_AL, MLOG$_AP_SUPPLIERS Growing So Much? Having Lots of Data
- Stanford机器学习第六讲(上)Advices for applying machine learning--Deciding what to try next
- 关于机器学习必须要了解的几个要点(A Few Useful Things to Know about Machine Learning)
- 【Paper Reading】A Few Useful Things to Know about Machine Learning【机器学习那些事】
- Machine Learning in action --AdaBoost(已勘误)
- Python (1) - 7 Steps to Mastering Machine Learning With Python
- Machine Learning and Data Mining for Computer Security: Methods and Applications
- LSH(Learning to Hash with its Application to Big Data Retrieval)
- How to compile C++ boost library with Intel C++ compiler