您的位置:首页 > 编程语言 > Python开发

MachineLearning— (KNN)k Nearest Neighbor实现手写数字识别(三)

2016-06-16 14:41 786 查看
    本篇博文主要结合前两篇的knn算法理论部分knn理论理解(一)knn理论理解(二),做一个KNN的实现,主要是根据《机器学习实战》这本书的内容,一个非常经典有趣的例子就是使用knn最近邻算法来实现对手写数字的识别,下面将给出Python代码,尽量使用详尽的解释和注解让大家更好地理解每一步的目的和方法,欢迎大家一起交流学习~~~

    我们使用的training数据保存在trainingDigits当中,我们总共使用了100个样本点数据,从0到9总共十个数字,每个数字有十个手写数据,所以我们准备了总共100个样本点来作为基准计算样本;


 
而每个手写数字则是以文本的形式保存,每一个手写数字都被提前进行了二值化处理,即只是用0,1两个数字来表现手写数字,如下所示:



这样做的目的显然可以方便我们将其转化为特征行向量进行计算,同时能够保存足够的原信息量用来进行数字识别;
而测试数据则是保存在testDigits文件当中,我们准备了50个样本点用来测试识别分类效果,十个数字,每个数字有五个测试点,所以我们总共有50个样本数据;(这里我们要注意一下文件的命名格式)



下面给出Python实现代码,代码主要包括三大块吧,每一块主要的功能分别是:
第一块的目的是要将原本32*32大小的矩阵数字转化为1*1024的行向量,第二块是测试分类模块程序(获得handwritinglabels数字标签列表 将训练数据保存到大矩阵trainingMat当中去)(逐一读取测试图片并将其送到classify0分类器当中得出分类识别结果并最终打印结果),第三块的目的是编写分类主程序即knn算法核心部分计算欧氏距离找出k近邻点并将最多的那个作为测试样本的识别结果;

<span style="font-size:14px;"># -*- coding: utf-8 -*-
#样本是32*32的二值图片,将其处理成1*1024的特征行向量
#第一块</span>
from numpy import *
import operator
from os import listdir

def img2vector(filename):
returnVect = zeros((1,1024))   <span style="font-size:14px;">#产生一个数组将来用于存储图片数据</span>
fr = open(filename)
for i in range(32):
lineStr = fr.readline()    <span style="font-size:14px;">#读取第一行的32个二值数据</span>
for j in range(32):
returnVect[0,32*i+j] = int(lineStr[j])   <span style="font-size:14px;">#以32为一个循环 逐渐填满1024个</span>
return returnVect

<span style="font-size:14px;">#第二块</span>
def handwritingClassTest():
<span style="font-size:14px;">#加载训练集到大矩阵trainingMat</span>
hwLabels = []
trainingFileList = listdir('C:\\Anaconda\\trainingDigits')           <span style="font-size:14px;">#os模块中的listdir('str')可以读取目录str下的所有文件名,返回一个字符串列表</span>
m = len(trainingFileList)                 <span style="font-size:14px;">#m表示总体训练样本个数</span>
trainingMat = zeros((m,1024))             <span style="font-size:14px;">#用来存放m个样本</span>
for i in range(m):
fileNameStr = trainingFileList[i]                  <span style="font-size:14px;">#训练样本的命名格式:1_120.txt 获取文件名</span>
fileStr = fileNameStr.split('.')[0]                <span style="font-size:14px;">#string.split('str')以字符str为分隔符切片,返回list,这里去list[0],得到类似1_120这样的</span>
classNumStr = int(fileStr.split('_')[0])           <span style="font-size:14px;">#以_切片,得到1,即数字类别</span>
hwLabels.append(classNumStr)                    <span style="font-size:14px;">#这样hwLabels列表中保存了m个样本点的所有类别</span>
trainingMat[i,:] = img2vector('C:\\Anaconda\\trainingDigits\\%s' % fileNameStr)    <span style="font-size:14px;">#将每个样本存到m*1024矩阵</span>

<span style="font-size:14px;">#逐一读取测试图片,同时将其分类 </span>
testFileList = listdir('C:\\Anaconda\\testDigits')     <span style="font-size:14px;">#测试文件名列表   </span>
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList[i]            <span style="font-size:14px;">#获取此刻要测试的文件名</span>
fileStr = fileNameStr.split('.')[0]
classNumStr = int(fileStr.split('_')[0])          <span style="font-size:14px;">#数字标签分类 重新生成的classNumStr 并没有使用上面已有的classNumStr</span>
vectorUnderTest = img2vector('C:\\Anaconda\\testDigits\\%s' % fileNameStr)  <span style="font-size:14px;">#将要测试的数字转化为一行向量</span>
classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)   <span style="font-size:14px;">#传参</span>
print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
if (classifierResult != classNumStr): errorCount += 1.0
print "\nthe total number of errors is: %d" % errorCount
print "\nthe total error rate is: %f" % (errorCount/float(mTest))

<span style="font-size:14px;">#第三块
#分类主程序,计算欧式距离,选择距离最小的前k个,返回k个中出现频次最高的类别作为分类别
#inX是所要测试的向量
#dataSet是训练样本集,一行对应一个样本,dataSet对应的标签向量为labels
#k是所选的最近邻数</span>
def classify0(inX, dataSet, labels, k):      <span style="font-size:14px;">#参数和上面的一一对应</span>
dataSetSize = dataSet.shape[0]                       <span style="font-size:14px;">#shape[0]得出dataSet的行数,即训练样本个数
diffMat = tile(inX, (dataSetSize,1)) - dataSet       #tile(A,(m,n))将数组A作为元素构造m行n列的数组 100*1024的数组
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)                  #array.sum(axis=1)按行累加,axis=0为按列累加
distances = sqDistances**0.5                        #就是sqrt的结果
sortedDistIndicies = distances.argsort()             #array.argsort(),得到每个元素按次排序后分别在原数组中的下标 从小到大排列
classCount={}                                        #sortedDistIndicies[0]表示排序后排在第一个的那个数在原来数组中的下标
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]   #最近的那个在原来数组中的下标位置
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 #一个字典从无到有的生成过程  get(key,x)从字典中获取key对应的value,没有key的话返回0
#classCount的形式:{5:3,0:6,1:7,2:1}
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) #sorted()函数,按照第二个元素即value的次序逆向(reverse=True)排序
#sorted第一个参数表示要排序的对象 iteritems代表键值对
return sortedClassCount[0][0]                        #经过sorted后的字典变成了[(),()]形式的键值对列表</span>


测试结果:

import knn

knn.handwritingClassTest()
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 0, the real answer is: 0
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 4, the real answer is: 4
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 5, the real answer is: 5
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 6, the real answer is: 6
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 7, the real answer is: 7
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 8, the real answer is: 8
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9
the classifier came back with: 9, the real answer is: 9

the total number of errors is: 0

the total error rate is: 0.000000


我们发现准确率还是挺高的,没有一个错误的,原因是我们使用的数据量比较小,恰巧没有一个错误的;

参考资料:http://blog.csdn.net/u012162613/article/details/41768407#t2
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息