机器学习实战--kMeans
2016-03-25 16:26
477 查看
前面的几个章节主要学习了监督学习,从这节开始,进入到无监督学习。这节的内容主要有kMeans,kMeans簇的后处理,二分kMeans。
![](http://img.blog.csdn.net/20160325161808674)
2、算方法实现:
1、初始质心的选择
2、距离计算
3、kMeans进行分类
1、算法原理:
![](http://img.blog.csdn.net/20160325162657731)
![](http://img.blog.csdn.net/20160325162854062)
2、主要算法实现
1、二分kMeans
注意事项:
1、kMeans是局部最优算法,影响其聚类效果的主要原因有分类类目k,距离的计算方法,因此,选不同的k,distMeas时,会有不同的结果,并且受其影响较大。
2、为了改善效果,可以采用后剪枝的方法进行修正,这里采用了SSE(sum of suqraed error).
一、kMeans
1、算法原理:2、算方法实现:
1、初始质心的选择
def randCent(dataSet, k): n = shape(dataSet)[1] centroids = mat(zeros((k,n)))#create centroid mat for j in range(n):#create random cluster centers, within bounds of each dimension minJ = min(dataSet[:,j]) rangeJ = float(max(dataSet[:,j]) - minJ) centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1)) return centroids
2、距离计算
def distEclud(vecA, vecB): return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB)
3、kMeans进行分类
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent): m = shape(dataSet)[0] clusterAssment = mat(zeros((m,2)))#create mat to assign data points #to a centroid, also holds SE of each point centroids = createCent(dataSet, k) clusterChanged = True while clusterChanged: clusterChanged = False for i in range(m):#for each data point assign it to the closest centroid minDist = inf; minIndex = -1 for j in range(k): distJI = distMeas(centroids[j,:],dataSet[i,:]) if distJI < minDist: minDist = distJI; minIndex = j #condition if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 print centroids #update centroids for cent in range(k):#recalculate centroids ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean return centroids, clusterAssment
二、二分kMeans
由于kMeans很容易手链到局部最优值,古引入二分kMeans。1、算法原理:
2、主要算法实现
1、二分kMeans
def biKmeans(dataSet, k, distMeas=distEclud): m = shape(dataSet)[0] clusterAssment = mat(zeros((m,2))) #initialize centroid0 = mean(dataSet, axis=0).tolist()[0] centList =[centroid0] #create a list with one centroid for j in range(m):#calc initial Error clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2 while (len(centList) < k): lowestSSE = inf for i in range(len(centList)): ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]#get the data points currently in cluster i centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas) sseSplit = sum(splitClustAss[:,1])#compare the SSE to the currrent minimum sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1]) print "sseSplit, and notSplit: ",sseSplit,sseNotSplit if (sseSplit + sseNotSplit) < lowestSSE: bestCentToSplit = i bestNewCents = centroidMat bestClustAss = splitClustAss.copy() lowestSSE = sseSplit + sseNotSplit bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #change 1 to 3,4, or whatever bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit print 'the bestCentToSplit is: ',bestCentToSplit print 'the len of bestClustAss is: ', len(bestClustAss) centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0]#replace a centroid with two best centroids centList.append(bestNewCents[1,:].tolist()[0]) clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss#reassign new clusters, and SSE return mat(centList), clusterAssment
注意事项:
1、kMeans是局部最优算法,影响其聚类效果的主要原因有分类类目k,距离的计算方法,因此,选不同的k,distMeas时,会有不同的结果,并且受其影响较大。
2、为了改善效果,可以采用后剪枝的方法进行修正,这里采用了SSE(sum of suqraed error).
相关文章推荐
- jquery validate自定义扩展实例,以及一些常用验证
- Android自定义Dialog带Dialog的显示消失动画(一)
- GCC内嵌汇编
- Light OJ 1354 IP Checking
- 委托开发合同与合作开发合同的区别
- xUtils的介绍
- 简洁的滚动鼠标 改变图片大小js
- Linux的僵尸进程处理2
- C# 中const和readonly的区别
- NGUI实现Sprite裁切成圆形或者椭圆形(不完美)
- MyEclipse Trial Expired,手动注册MyEclipse
- static使用小结
- html5-detial
- html5-detial
- 根据官方api调用百度地图定位
- php单例
- POJ 2456 Aggressive cows(二分查找 最大化最小值)
- vim 语法高亮
- jqprint 打印网页 jQuery print plugin
- hdu 2586(LCA + 节点间距离)