您的位置:首页 > 编程语言 > Python开发

聚类算法——python实现k-means算法

2017-05-11 15:54 393 查看

算法思想

通俗的说,就是把一些样本按照相似度分成k类。

给定样本集D={x1, x2, x3, ……, xm}, 划分为k类得到集合C = {C1, C2, ……, Ck},(其中Ci,1<=i<=k, 是包含若干个样本xi, 1<=i<=m, 的集合,使得平方误差最小化,即


其中


ui是Ci类中所有样本的均值向量。但是最小化E是一个NP难问题, 所以采用了迭代优化的方式来近似求解。

大致思想如下:

从样本中随机选取k个样本最为初始均值向量的值

遍历每个样本,计算样本到k个均值向量的距离(欧几里得距离),选择距离最近的均值向量,将该样本划入这一类。

均值向量的迭代更新:根据样本的划分重新计算ui的值, 如果ui改变了,则更新ui,否则保持不变,重复第2步,直到没有均值向量更新为止。(为避免运行时间过长,通常设定一个最大迭代次数或最小调整阈值,若达到最大迭代次数或调整幅度小于最小调整阈值,就停止运行)

举例说明

数据用的是周志华老师的西瓜书里面的西瓜数据哈哈~~

编号,密度,含糖率,好瓜
1,0.697,0.46,Y
2,0.774,0.376,Y
3,0.634,0.264,Y
4,0.608,0.318,Y
5,0.556,0.215,Y
6,0.403,0.237,Y
7,0.481,0.149,Y
8,0.437,0.211,Y
9,0.666,0.091,N
10,0.243,0.267,N
11,0.245,0.057,N
12,0.343,0.099,N
13,0.639,0.161,N
14,0.657,0.198,N
15,0.36,0.37,N
16,0.593,0.042,N
17,0.719,0.103,N


# -*- coding:utf-8 -*-
import re
import math
import numpy as np
import pylab as pl
data = \
"1,0.697,0.46,Y,\
2,0.774,0.376,Y,\
3,0.634,0.264,Y,\
4,0.608,0.318,Y,\
5,0.556,0.215,Y,\
6,0.403,0.237,Y,\
7,0.481,0.149,Y,\
8,0.437,0.211,Y,\
9,0.666,0.091,N,\
10,0.243,0.267,N,\
11,0.245,0.057,N,\
12,0.343,0.099,N,\
13,0.639,0.161,N,\
14,0.657,0.198,N,\
15,0.36,0.37,N,\
16,0.593,0.042,N,\
17,0.719,0.103,N"
#定义一个西瓜类,四个属性,分别是编号,密度,含糖率,是否好瓜
class watermelon:
def __init__(self, properties):
self.number = properties[0]
self.density = float(properties[1])
self.sweet = float(properties[2])
self.good = properties[3]

#数据简单处理
a = re.split(',|\n|\t', data.strip(" "))
dataset = []     #dataset:数据集
for i in range(int(len(a)/4)):
temp = tuple(a[i * 4: i * 4 + 4])
dataset.append(watermelon(temp))

#计算欧几里得距离,a,b分别为两个元组
def dist(a, b):
return math.sqrt(math.pow(a[0]-b[0], 2)+math.pow(a[1]-b[1], 2))

#算法模型
def k_means(k, dataset, max_iter):
U = np.random.choice(dataset, k)
U = [(wm.density, wm.sweet) for wm in U]    #均值向量列表
C = [[] for i in range(k)]      #初始化分类列表
U_update = []                   #均值向量更新列表
while max_iter > 0:
#分类
for i in dataset:
temp = np.argmin([dist((i.density, i.sweet), U[j]) for j in range(len(U))])
C[temp].append(i)
#更新均值向量
for i in range(k):
ui_density = 0.0
ui_sweet = 0.0
for j in C[i]:
ui_density += j.density
ui_sweet += j.sweet
U_update.append((ui_density/len(C[i]), ui_sweet/len(C[i])))
#每五次输出一次分类图
if max_iter % 5 == 0:
draw(C, U)
#比较U和U_u
aed9
pdate
if U == U_update:
break
U = U_update
U_update = []
C = [[] for i in range(k)]
max_iter -= 1

return C, U

#画图
def draw(C, U):
colValue = ['r', 'y', 'g', 'b', 'c', 'k', 'm']
for i in range(len(C)):
coo_X = []    #x坐标列表
coo_Y = []    #y坐标列表
for j in range(len(C[i])):
coo_X.append(C[i][j].density)
coo_Y.append(C[i][j].sweet)
pl.scatter(coo_X, coo_Y, marker='x', color=colValue[i%len(C)], label=str(i))
#展示均值向量
U_x = []
U_y = []
for i in U:
U_x.append(i[0])
U_y.append(i[1])
pl.scatter(U_x, U_y, marker='.', color=colValue[6], label="avg_vector")
pl.legend(loc='upper right')
pl.show()

C, U = k_means(3, dataset, 30)
draw(C, U)


运行结果

下图两个图是一次运行程序的迭代结果。

第一张图是最开始初始化的样子,均值向量和样本点重合。

第二张图为最后聚类结果。





参考文献:《机器学习》周志华 第九章
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  算法