您的位置：首页 > 编程语言 > Python开发

逻辑回归python实现实例

2016-10-28 10:18 471 查看

这个例子是《机器学习实战》（<machine learning in action>）逻辑回归的一个实例：从疝气病症预测病马的死亡率。

疝病是描述马胃肠痛的术语。该数据集中包含了医院检查马疝病的一些指标，我们的目标是通过这些指标（特征），来预测马是否会死亡。

数据集包括229个训练样本和67各测试样本，特征数量为22。数据集中包含缺失值，采取的措施是用0替换所有缺失值，这样做的原因是，我们在更新theta值时使用下式（具体参考上一篇），当某个特征值缺失时，我们如果用0来替代，则对应的theta值不会被更新（公式右边第二项=0），即缺失值不会对参数造成影响。

以下为python代码，由于训练数据比较少，这边使用了批处理梯度下降法，没有使用增量梯度下降法。

##author:lijiayan
##data:2016/10/27
##name:logReg.py
from numpy import *
import matplotlib.pyplot as plt

def loadData(filename):
data = loadtxt(filename)
m,n = data.shape
print 'the number of  examples:',m
print 'the number of features:',n-1
x = data[:,0:n-1]
y = data[:,n-1:n]
return x,y

#the sigmoid function
def sigmoid(z):
return 1.0 / (1 + exp(-z))

#the cost function
def costfunction(y,h):
y = array(y)
h = array(h)
J = sum(y*log(h))+sum((1-y)*log(1-h))
return J

# the batch gradient descent algrithm
def gradescent(x,y):
m,n = shape(x)     #m: number of training example; n: number of features
x = c_[ones(m),x]     #add x0
x = mat(x)      # to matrix
y = mat(y)
a = 0.0000025       # learning rate
maxcycle = 4000
theta = zeros((n+1,1))  #initial theta

J = []
for i in range(maxcycle):
h = sigmoid(x*theta)
theta = theta + a * (x.T)*(y-h)
cost = costfunction(y,h)
J.append(cost)

plt.plot(J)
plt.show()
return theta,cost

#the stochastic gradient descent (m should be large,if you want the result is good)
def stocGraddescent(x,y):
m,n = shape(x)     #m: number of training example; n: number of features
x = c_[ones(m),x]     #add x0
x = mat(x)      # to matrix
y = mat(y)
a = 0.01       # learning rate
theta = ones((n+1,1))    #initial theta

J = []
for i in range(m):
h = sigmoid(x[i]*theta)
theta = theta + a * x[i].transpose()*(y[i]-h)
cost = costfunction(y,h)
J.append(cost)
plt.plot(J)
plt.show()
return theta,cost

#plot the decision boundary
def plotbestfit(x,y,theta):
plt.plot(x[:,0:1][where(y==1)],x[:,1:2][where(y==1)],'ro')
plt.plot(x[:,0:1][where(y!=1)],x[:,1:2][where(y!=1)],'bx')
x1= arange(-4,4,0.1)
x2 =(-float(theta[0])-float(theta[1])*x1) /float(theta[2])

plt.plot(x1,x2)
plt.xlabel('x1')
plt.ylabel(('x2'))
plt.show()

def classifyVector(inX,theta):
prob = sigmoid((inX*theta).sum(1))
return where(prob >= 0.5, 1, 0)

def accuracy(x, y, theta):
m = shape(y)[0]
x = c_[ones(m),x]
y_p = classifyVector(x,theta)
accuracy = sum(y_p==y)/float(m)
return accuracy

调用上面代码：

from logReg import *
x,y = loadData("horseColicTraining.txt")
theta,cost = gradescent(x,y)
print 'J:',cost

ac_train = accuracy(x, y, theta)
print 'accuracy of the training examples:', ac_train

x_test,y_test = loadData('horseColicTest.txt')
ac_test = accuracy(x_test, y_test, theta)
print 'accuracy of the test examples:', ac_test

学习速率=0.0000025，迭代次数=4000时的结果：

似然函数走势（J = sum(y*log(h))+sum((1-y)*log(1-h))），似然函数是求最大值，一般是要稳定了才算最好。

下图为计算结果，可以看到训练集的准确率为73%，测试集的准确率为78%。

这个时候，我去看了一下数据集，发现没个特征的数量级不一致，于是我想到要进行归一化处理：

归一化处理句修改列loadData(filename)函数：

def loadData(filename):
data = loadtxt(filename)
m,n = data.shape
print 'the number of  examples:',m
print 'the number of features:',n-1
x = data[:,0:n-1]
max = x.max(0)
min = x.min(0)
x = (x - min)/((max-min)*1.0)     #scaling
y = data[:,n-1:n]
return x,y

在没有归一化的时候，我的学习速率取了0.0000025（加大就会震荡，因为有些特征的值很大，学习速率取的稍大，波动就很大），由于学习速率小，迭代了4000次也没有完全稳定。现在当把特征归一化后（所有特征的值都在0~1之间），这样学习速率可以加大，迭代次数就可以大大减少，以下是学习速率=0.005，迭代次数=500的结果：

此时的训练集的准确率为72%，测试集的准确率为73%

从上面这个例子，我们可以看到对特征进行归一化操作的重要性。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 算法 numpy 机器学习数据分析

相关文章推荐

新的分享

章节导航