您的位置:首页 > 其它

scikit-learn 中的随机森林用法

2017-06-18 17:36 225 查看
随机森林是一种以决策树为基分类器的常用集成分类器,使用取平均方法组合基分类器来预测样本类别。在Python的机器学习包scikit-learn中已经有具体实现。

下面给出使用方法

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10)
model.fit(train_x, train_y)


其中 train_x为训练样本特征集,train_y为对应的样本标签。

下面给出RandomForestClassifier函数的输入参数:

sklearn.ensemble.RandomForestClassifier(n_estimators=10,
criterion='gini', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0,
max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07,
bootstrap=True, oob_score=False, n_jobs=1, random_state=None,
verbose=0, warm_start=False, class_weight=None)


主要参数(Parameters)有:

n_estimators : 森林中树的数量,默认为10。

criterion : 结点属性划分度量准则,可选择“gini”准则,即基尼不纯度度量准则,或者是“entropy”准则, 即信息增益度量准则,默认为“gini”准则。此参数为决策树分类器独有。

max_features: 寻找最佳属性划分时所使用的特征数量。

If int, then consider max_features features at each split.

If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.

If “auto”, then max_features=sqrt(n_features).

If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

If “log2”, then max_features=log2(n_features).

If None, then max_features=n_features.

max_depth : 树的最大深度。默认 max_depth=None, 此时结点会一直增长,直到结点下所有样本均为同一类别,或者样本数目不大于min_samples_split 。

min_samples_split :(default=2),内部结点所需划分的最小样本数,如果是int类型,那么当属于该结点的样本数不大于该值时,不再进行分裂。如果是float类型,min_samples_split 是比例系数,最小样本数为ceil(min_samples_split * n_samples) 。

min_samples_leaf : 叶子结点最小样本数。

min_weight_fraction_leaf : The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a l
4000
eaf node. Samples have equal weight when sample_weight is not provided.

max_leaf_nodes :树的最大叶结点数,如果是None,则不限制。

min_impurity_split : float, optional (default=1e-7)

Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

bootstrap : boolean, optional (default=True)

Whether bootstrap samples are used when building trees.

oob_score : bool (default=False)

Whether to use out-of-bag samples to estimate the generalization accuracy.

n_jobs : integer, optional (default=1)

并行计算时使用的核数目。为-1时,使用所有核。

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

verbose : int, optional (default=0)

Controls the verbosity of the tree building process.

warm_start : bool, optional (default=False)

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

class_weight : dict, list of dicts, “balanced”,

“balanced_subsample” or None, optional (default=None) Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

比较重要的几个模型属性:

feature_importances_ : array of shape = [n_features]

特征重要性,值越大,特征相对越重要。

n_features_ : int 模型拟合时使用的特征数量。

Reference:http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: