随机森林的几个重要参数
2016-11-05 11:17
204 查看
翻译自:https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/
There are primarily 3 features which can be tuned to improve the predictive power of the model :
说明:随机森林有3个比较重要的参数,对结果影响比较大,max_features,n_estimators,min_sample_leaf
1.a. max_features:
These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available
in Python to assign maximum features. Here are a few of them :
Auto/None : This will simply take all the features which make sense in every tree.Here we simply do not put any restrictions on
the individual tree.
sqrt : This option will take square root of the total number of features in individual run. For instance, if the total number
of variables are 100, we can only take 10 of them in individual tree.”log2″ is another similar type of option for max_features.
0.2 : This option allows the random forest to take 20% of variables in individual run. We can assign and value in a format “0.x”
where we want x% of features to be considered.
How does “max_features” impact performance and speed?
Increasing max_features generally improves the performance of the model as at each node now we have a higher number of options to
be considered. However, this is not necessarily true as this decreases the diversity of individual tree which is the USP of random forest. But, for sure, you decrease the speed of algorithm by increasing the max_features. Hence, you need to strike the right
balance and choose the optimal max_features.
This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower. You should choose as high
value as your processor can handle because this makes your predictions stronger and more stable.
If you have built a decision tree before, you can appreciate the importance of minimum sample leaf size. Leaf is the end node of a decision tree. A smaller leaf makes the model more
prone to capturing noise in train data. Generally I prefer a minimum leaf size of more than 50. However, you should try multiple leaf sizes to find the most optimum for
your use case.
说明:如果min_sample_leaf过小,很容易过拟合,学习到噪声
There are primarily 3 features which can be tuned to improve the predictive power of the model :
说明:随机森林有3个比较重要的参数,对结果影响比较大,max_features,n_estimators,min_sample_leaf
1.a. max_features:
These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available
in Python to assign maximum features. Here are a few of them :
Auto/None : This will simply take all the features which make sense in every tree.Here we simply do not put any restrictions on
the individual tree.
sqrt : This option will take square root of the total number of features in individual run. For instance, if the total number
of variables are 100, we can only take 10 of them in individual tree.”log2″ is another similar type of option for max_features.
0.2 : This option allows the random forest to take 20% of variables in individual run. We can assign and value in a format “0.x”
where we want x% of features to be considered.
How does “max_features” impact performance and speed?
Increasing max_features generally improves the performance of the model as at each node now we have a higher number of options to
be considered. However, this is not necessarily true as this decreases the diversity of individual tree which is the USP of random forest. But, for sure, you decrease the speed of algorithm by increasing the max_features. Hence, you need to strike the right
balance and choose the optimal max_features.
1.b. n_estimators :
This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower. You should choose as highvalue as your processor can handle because this makes your predictions stronger and more stable.
1.c. min_sample_leaf :
If you have built a decision tree before, you can appreciate the importance of minimum sample leaf size. Leaf is the end node of a decision tree. A smaller leaf makes the model moreprone to capturing noise in train data. Generally I prefer a minimum leaf size of more than 50. However, you should try multiple leaf sizes to find the most optimum for
your use case.
说明:如果min_sample_leaf过小,很容易过拟合,学习到噪声
相关文章推荐
- nginx高性能java web应用几个重要参数
- DedeCMS系统参数中的几个重要设置
- JVM 几个重要的参数
- Mysql JDBC URL中几个重要参数说明
- 影响Mysql性能的几个重要参数说明
- SparkStreamingj集成Kafka的几个重要参数
- JVM调优总结(十一)JVM 几个重要的参数
- grep命令几个重要的参数
- Mysql JDBC URL中几个重要参数说明
- 几个重要的jvm参数配置及建议
- 几个重要的 ASM Disk Groups 参数
- WindowsCE的pbcxml文件的几个重要参数
- Linux_ top 指令的几个重要的参数
- mysql 优化中的几个重要参数
- 单表恢复的几个重要参数
- 关于调整linux内核的几个重要参数
- MySQL主从复制几个重要的参数
- 几个重要的 ASM Disk Groups 参数
- 影响postgresql性能的几个重要参数