您的位置:首页 > Web前端

sklearn源码解析:ensemble模型 零碎记录;如何看sklearn代码,以tree的feature_importance为例

2016-07-12 10:14 836 查看




随机森林 和 GBDT





)In random forests (see 
each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all
features. Instead, the split that is picked is the best split among a random subset of the features. ===》 训练树之前,bootstrap出样本,训练每个节点时,才sample特征。。。。。

extremely randomized trees (see 
randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature
and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias ===》 训练树之前,bootstrap出样本,训练每个节点时,才sample特征。但是,特征上的threshold也是随机sample【一些】出来选择最优,而不是【所有threshold中】最优的。

an unsupervised transformation of the data. Using a forest of completely random trees,
the data by the indices of the leaves a data point ends up in. This index is then encoded in a one-of-K manner, leading to a high dimensional, sparse binary coding. This coding can be computed very efficiently and can then be used as a basis for other learning
tasks. The size and sparsity of the code can be influenced by choosing the number of trees and the maximum depth per tree. For each tree in the ensemble, the coding contains one entry of one. The size of the coding is at most 
n_estimators * 2 ** max_depth
the maximum number of leaves in the forest. As
neighboring data points are more likely to lie within the same leaf of a tree, the transformation performs an implicit, non-parametric density estimation. ===》 将样本通过树编码成one-hot-encoding形式,再训练。。。


)上面这些随机模型,都强调【bootstrap,sample feature,max_depth,GBDT自己特殊的shrinkage(其他模型叫learning_rate)】和n_estimators的交互。。。The
figure below illustrates the effect of shrinkage and subsampling on the goodness-of-fit of the model. We can clearly see that shrinkage outperforms no-shrinkage. Subsampling with shrinkage can further increase the accuracy of the model. Subsampling without
shrinkage, on the other hand, does poorly. ===》bootstrap(0.5),sample feature(0.8)和shrinkage(<0.1)要同时使用,并且通过validation
set的early stop实现tree size的控制,以达到最有效果。。。

used at the top of the tree are used contribute to the final prediction decision of a larger fraction of the input samples. The expected
fraction of the samples
 they contribute to can thus be used as an estimate of the relative
importance of the features
. By averaging those
expected activity rates over several randomized trees one can reduce the variance of
such an estimate and use it for feature selection. ===》 我知道大家一定明白什么叫做【they contribute to】。不明白的看下代码(GBDT为例):


def feature_importances_(self):
"""Return the feature importances (the higher, the more important the
feature_importances_ : array, shape = [n_features]

total_sum = np.zeros((self.n_features, ), dtype=np.float64)
for stage in self.estimators_:
stage_sum = sum(tree.feature_importances_
for tree in stage) / len(stage)
total_sum += stage_sum

importances = total_sum / len(self.estimators_)
return importances

上面说了,GBDT训练时使用的base tree是【tree=

def feature_importances_(self):
"""Return the feature importances.
The importance of a feature is computed as the (normalized) total
reduction of the criterion brought by that feature.
It is also known as the Gini importance.
feature_importances_ : array, shape = [n_features]
if self.tree_ is None:
raise NotFittedError("Estimator not fitted, call `fit` before"
" `feature_importances_`.")

return self.tree_.compute_feature_importances()

self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_)

from ._tree import Tree
好吧,我能告诉你,到这里之后,所有的真正计算必须要下载source code,继续扒cython代码才可以知道吗。。。哈哈,具体文件位置为:E:\scikit-learn-master\sklearn\tree\_tree.pyx的L1033,pyx都是cython文件了,到不难,但要考察耐心。。。。
cpdef compute_feature_importances(self, normalize=True):
"""Computes the importance of each feature (aka variable)."""
cdef Node* left
cdef Node* right
cdef Node* nodes = self.nodes
cdef Node* node = nodes
cdef Node* end_node = node + self.node_count

cdef double normalizer = 0.

cdef np.ndarray[np.float64_t, ndim=1] importances
importances = np.zeros((self.n_features,))
cdef DOUBLE_t* importance_data = <DOUBLE_t*>importances.data

with nogil:
while node != end_node:
if node.left_child != _TREE_LEAF:
# ... and node.right_child != _TREE_LEAF:
left = &nodes[node.left_child]
right = &nodes[node.right_child]

importance_data[node.feature] += (
node.weighted_n_node_samples * node.impurity -
left.weighted_n_node_samples * left.impurity -
right.weighted_n_node_samples * right.impurity)
node += 1

importances /= nodes[0].weighted_n_node_samples

if normalize:
normalizer = np.sum(importances)

if normalizer > 0.0:
# Avoid dividing by zero (e.g., when root is pure)
importances /= normalizer

return importances


第一点:weighted_n_node_samples : array of int, shape [node_count]

        weighted_n_node_samples[i] holds the weighted number of training samples reaching node i.

第二点:impurity : array of double, shape [node_count]

        impurity[i] holds the impurity (i.e., the value of the splitting criterion) at node i.

对比一下上面的描述【The expected
fraction of the samples
 they contribute to 】,可以发现这只描述了【第一点的sample】,而且还缺少weighted的描述。

那么第二点在哪里描述的呢???===》在DecisionTreeRegressor、DecisionTreeClassifier的feature_importance的介绍中:The feature importances. The higher, the more important
the feature. The importance of a feature is computed as the (normalized) 【total reduction of the criterion】 brought
by that feature. It is also known as the Gini importance [R70]



先看下何时用他们:max_leaf_nodes :
int or None, optional (default=None)

Grow a tree with 
best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes. If not None then
be ignored.

换句话,max_leaf_nodes有效,则使用best first;否则使用depth first。

先来看depth first【深度优先,先构建左孩子,再构建右孩子】:


with nogil:
# push root node onto stack
rc = stack.push(0, n_node_samples, 0, _TREE_UNDEFINED, 0, INFINITY, 0)
if rc == -1:
# got return code -1 - out-of-memory
with gil:
raise MemoryError()

while not stack.is_empty():
<span style="white-space:pre">		</span>......
if not is_leaf:
# Push right child on stack
rc = stack.push(split.pos, end, depth + 1, node_id, 0,
split.impurity_right, n_constant_features)
if rc == -1:

# Push left child on stack
rc = stack.push(start, split.pos, depth + 1, node_id, 1,
split.impurity_left, n_constant_features)
if rc == -1:

再看下best first:

下面的frontier就是一个优先队列,保存了每个tree node的relative reduction
in impurity,所以知道了,哪个relative reduction in impurity大,就分裂哪个结点。


with nogil:
# add root to frontier
rc = self._add_split_node(splitter, tree, 0, n_node_samples,
if rc >= 0:
rc = _add_to_frontier(&split_node_left, frontier)

if rc == -1:
with gil:
raise MemoryError()

while not frontier.is_empty():


内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息