您的位置：首页 > 编程语言 > Python开发

Python机器学习算法——决策树

2017-07-31 22:57 519 查看

决策树简介

决策树是常见的机器学习算法之一。主要用于分类和回归。是一种非参数的监督式学习方法。

决策树中的几个词：属性、特征、属性选择度量、属性特征的拓扑结构、分裂属性特征。

比如对于女子择偶这一决策过程

属性：男子的年龄，收入，容貌等称之为属性

特征：年龄大于30，小于30、收入高，收入低、帅、一般、丑（这就是对应属性的特征，举例为方便理解设置为离散的，实际可以是连续的）

决策树的构造就是进行属性选择度量策略的制定，并最终计算得到各个属性重要性程度以及对应属性的特征间的拓扑结构。决策树构造的关键之二就是分裂属性的特征，就是在某一个节点处按照属性的特征的差异构造不同的分支。其目的是为了让各个分支最“纯”。意思是尽量让各个分支中的待分类项属于同一类别。构造决策树的关键内容就是如何确定属性选择度量策略，这是一种分裂准则，这种准则决定了拓扑结构和分裂点的选择。常见的属性度量策略：ID3，C4.5

ID3策略

从信息论知识中我们直到，期望信息越小，信息增益越大，从而纯度越高。所以ID3算法的核心思想就是以信息增益度量属性选择，选择分裂后信息增益最大的属性进行分裂。下面先定义几个要用到的概念。

设D为用类别对训练元组进行的划分，则D的熵（entropy）表示为：

$info(D)=-\sum ^m_{i=1}p_ilog_2(p_i)$

其中pi表示第i个类别在整个训练元组中出现的概率，可以用属于此类别元素的数量除以训练元组元素总数量作为估计。熵的实际意义表示是D中元组的类标号所需要的平均信息量。

现在我们假设将训练元组D按属性A进行划分，则A对D划分的期望信息为：

$info_A(D)=\sum ^v_{j=1}\frac{|D_j|}{|D|}info(D_j)$

而信息增益即为两者的差值：

$gain(A)=info(D)-info_A(D)$

ID3算法就是在每次需要分裂时，计算每个属性的增益率，然后选择增益率最大的属性进行分裂。下面我们继续用SNS社区中不真实账号检测的例子说明如何使用ID3算法构造决策树。为了简单起见，我们假设训练集合包含10个元素：

其中s、m和l分别表示小、中和大。

设L、F、H和R表示日志密度、好友密度、是否使用真实头像和账号是否真实，下面计算各属性的信息增益。

$info_L(D)=0.3*(-\frac{0}{3}log_2\frac{0}{3}-\frac{3}{3}log_2\frac{3}{3}) 0.4*(-\frac{1}{4}log_2\frac{1}{4}-\frac{3}{4}log_2\frac{3}{4}) 0.3*(-\frac{1}{3}log_2\frac{1}{3}-\frac{2}{3}log_2\frac{2}{3})=0 0.326 0.277=0.603$

$gain(L)=0.879-0.603=0.276$

因此日志密度的信息增益是0.276。

用同样方法得到H和F的信息增益分别为0.033和0.553。

因为F具有最大的信息增益，所以第一次分裂选择F为分裂属性，分裂后的结果如下图表示：

在上图的基础上，再递归使用这个方法计算子节点的分裂属性，最终就可以得到整个决策树。

上面为了简便，将特征属性离散化了，其实日志密度和好友密度都是连续的属性。对于特征属性为连续值，可以如此使用ID3算法：

先将D中元素按照特征属性排序，则每两个相邻元素的中间点可以看做潜在分裂点，从第一个潜在分裂点开始，分裂D并计算两个集合的期望信息，具有最小期望信息的点称为这个属性的最佳分裂点，其信息期望作为此属性的信息期望。

C4.5策略

ID3算法存在一个问题，就是偏向于多值属性，例如，如果存在唯一标识属性ID，则ID3会选择它作为分裂属性，这样虽然使得划分充分纯净，但这种划分对分类几乎毫无用处。ID3的后继算法C4.5使用增益率（gain ratio）的信息增益扩充，试图克服这个偏倚。

C4.5算法首先定义了“分裂信息”，其定义可以表示成：

$split\_info_A(D)=-\sum ^v_{j=1}\frac{|D_j|}{|D|}log_2(\frac{|D_j|}{|D|})$

其中各符号意义与ID3算法相同，然后，增益率被定义为：

$gain\_ratio(A)=\frac{gain(A)}{split\_info(A)}$

C4.5选择具有最大增益率的属性作为分裂属性，其具体应用与ID3类似，不再赘述。

Sklearn实现决策树分类回归

分类 DecisionTreeClassifier

二分类

输入两个向量：

向量X，大小为[n_samples,n_features]，用于记录训练样本；

向量Y，大小为[n_samples]，用于存储训练样本的类标签。

import numpy as  np
import matplotlib.pyplot as plt
from sklearn import tree
X = [[0, 0],
[1, 1]]
Y = [0, 1]
model  = tree.DecisionTreeClassifier()
model = model.fit(X, Y)
a = model.predict([[0, 0]])
b = model.predict_proba([[0.,2.]])#

print X,a,b
#[[0, 0], [1, 1]] [0] [[ 1.  0.]]

多类别分类

from sklearn import tree
import pydotplus
iris = load_iris()
model = tree.DecisionTreeClassifier()
model = model.fit(iris.data, iris.target)
#with open("iris.dot", "w") as f:
#    f = tree.export_graphviz(model, out_file=f)
a = model.predict(iris.data[:1])
print (iris.data[:1,:])
print (a)
#[[ 5.1  3.5  1.4  0.2]]
#[0]

回归 DecisionTreeClassifier

回归

from sklearn import tree
X = [[0,0],[2,2]]
Y = [0.5, 2.5]
model = tree.DecisionTreeRegressor()
model = model.fit(X, Y)
a = model.predict([[0,2]])
print(a)
#[ 0.5]

拟合正弦曲线

# Import the necessary modules and libraries
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

# Create a random dataset
rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))

# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_1.fit(X, y)
regr_2.fit(X, y)

# Predict
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)

# Plot the results
plt.figure()
plt.scatter(X, y, c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

多输出分类

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_olivetti_faces
from sklearn.utils.validation import check_random_state

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV

# Load the faces datasets
data = fetch_olivetti_faces()
targets = data.target

data = data.images.reshape((len(data.images), -1))
train = data[targets < 30]
test = data[targets >= 30]  # Test on independent people

# Test on a subset of people
n_faces = 5
rng = check_random_state(4)
face_ids = rng.randint(test.shape[0], size=(n_faces, ))
test = test[face_ids, :]

n_pixels = data.shape[1]
# Upper half of the faces
X_train = train[:, :(n_pixels + 1) // 2]
# Lower half of the faces
y_train = train[:, n_pixels // 2:]
X_test = test[:, :(n_pixels +
aba1
1) // 2]
y_test = test[:, n_pixels // 2:]

# Fit estimators
ESTIMATORS = {
"Extra trees": ExtraTreesRegressor(n_estimators=10, max_features=32,
random_state=0),
"K-nn": KNeighborsRegressor(),
"Linear regression": LinearRegression(),
"Ridge": RidgeCV(),
}

y_test_predict = dict()
for name, estimator in ESTIMATORS.items():
estimator.fit(X_train, y_train)
y_test_predict[name] = estimator.predict(X_test)

# Plot the completed faces
image_shape = (64, 64)

n_cols = 1 + len(ESTIMATORS)
plt.figure(figsize=(2. * n_cols, 2.26 * n_faces))
plt.suptitle("Face completion with multi-output estimators", size=16)

for i in range(n_faces):
true_face = np.hstack((X_test[i], y_test[i]))

if i:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 1)
else:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 1,
title="true faces")

sub.axis("off")
sub.imshow(true_face.reshape(image_shape),
cmap=plt.cm.gray,
interpolation="nearest")

for j, est in enumerate(sorted(ESTIMATORS)):
completed_face = np.hstack((X_test[i], y_test_predict[est][i]))

if i:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j)

else:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j,
title=est)

sub.axis("off")
sub.imshow(completed_face.reshape(image_shape),
cmap=plt.cm.gray,
interpolation="nearest")

plt.show()

东华大学智能系统与网络智能实验室（ysding）

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航