您的位置：首页 > 编程语言 > Python开发

Clustering (1): k-means（Python实现）

2015-04-14 10:38 405 查看

1、什么是聚类

Clustering 中文翻译作“聚类”，简单地说就是把相似的东西分到一组，同
Classification (分类)不同，对于一个 classifier ，通常需要你告诉它“这个东西被分为某某类”这样一些例子，理想情况下，一个 classifier 会从它得到的训练集中进行“学习”，从而具备对未知数据进行分类的能力，这种提供训练数据的过程通常叫做 supervised
learning (监督学习)，而在聚类的时候，我们并不关心某一类是什么，我们需要实现的目标只是把相似的东西聚到一起，因此，一个聚类算法通常只需要知道如何计算相似度就可以开始工作了，因此 clustering 通常并不需要使用训练数据进行学习，这在 Machine Learning 中被称作 unsupervised
learning (无监督学习)。

2、k-means算法

k-means 所要优化的目标函数：设我们一共有 N 个数据点需要分为 K 个 cluster ，k-means 要做的就是最小化

这个函数，其中

在数据点 n 被归类到 cluster k 的时候为 1 ，否则为 0 。直接寻找

和

来最小化

并不容易，不过我们可以采取迭代的办法：先固定

，选择最优的

，很容易看出，只要将数据点归类到离他最近的那个中心就能保证

最小。下一步则固定

，再求最优的

。将

对

求导并令导数等于零，很容易得到

最小的时候

应该满足：

亦即

的值应当是所有 cluster k 中的数据点的平均值。由于每一次迭代都是取到

的最小值，因此

只会不断地减小（或者不变），而不会增加，这保证了
k-means 最终会到达一个极小值。虽然 k-means 并不能保证总是能得到全局最优解，但是对于这样的问题，像 k-means 这种复杂度的算法，这样的结果已经是很不错的了。

下面我们来总结一下 k-means 算法的具体步骤：
选定 K 个中心

的初值。这个过程通常是针对具体的问题有一些启发式的选取方法，或者大多数情况下采用随机选取的办法。因为前面说过 k-means 并不能保证全局最优，而是否能收敛到全局最优解其实和初值的选取有很大的关系，所以有时候我们会多次选取初值跑
k-means ，并取其中最好的一次结果。
将每个数据点归类到离它最近的那个中心点所代表的 cluster 中。
用公式

计算出每个
cluster 的新的中心点。
重复第二步，一直到迭代了最大的步数或者前后的

的值相差小于一个阈值为止。

3、Python实现
数据集：100*2(二维数据)

python代码：

# -*- coding: utf-8 -*-
"""
Created on Mon Apr 13 19:59:59 2015

@author: Administrator
"""
from __future__ import with_statement
import cPickle as pickle
from matplotlib import pyplot
from numpy import zeros, array, tile
from scipy.linalg import norm
import numpy.matlib as ml
import random

def kmeans(X, k, observer=None, threshold=1e-15, maxiter=300):
N = len(X)
labels = zeros(N, dtype=int)
centers = array(random.sample(X, k))
iter = 0

def calc_J():
sum = 0
for i in xrange(N):
sum += norm(X[i]-centers[labels[i]])
return sum

def distmat(X, Y):
n = len(X)
m = len(Y)
xx = ml.sum(X*X, axis=1)
yy = ml.sum(Y*Y, axis=1)
xy = ml.dot(X, Y.T)

return tile(xx, (m, 1)).T+tile(yy, (n, 1)) - 2*xy

Jprev = calc_J()
while True:
# notify the observer
if observer is not None:
observer(iter, labels, centers)

# calculate distance from x to each center
# distance_matrix is only available in scipy newer than 0.7
# dist = distance_matrix(X, centers)
dist = distmat(X, centers)
# assign x to nearst center
labels = dist.argmin(axis=1)
# re-calculate each center
for j in range(k):
idx_j = (labels == j).nonzero()
centers[j] = X[idx_j].mean(axis=0)

J = calc_J()
iter += 1

if Jprev-J < threshold:
break
Jprev = J
if iter >= maxiter:
break

# final notification
if observer is not None:
observer(iter, labels, centers)

if __name__ == '__main__':
# load previously generated points
with open('E:\\Graduate\\Python\\machine_learning_in_action\\cluster.pkl') as inf:
samples = pickle.load(inf)
N = 0

def str_to_array(string):
x=string.split(" ")
return x

txtpath=("E:\Graduate\Python\machine_learning_in_action\data01.txt")
fp=open(txtpath)
i=0
arr=zeros([100,2],float)

for linea in fp.readlines():
str=str_to_array(linea)
arr[i][0]=float(str[0])
arr[i][1]=float(str[1])
i=i+1
fp.close()
X=arr.copy()

def observer(iter, labels, centers):
print "iter %d." % iter
colors = array([[1,0,0], [0, 1,0],[0,0,1]])
pyplot.plot(hold=False)  # clear previous plot
pyplot.hold(True)

# draw points
data_colors=[colors[lbl] for lbl in labels]
pyplot.scatter(X[:, 0], X[:, 1], c=data_colors, alpha=0.5)
# draw centers
pyplot.scatter(centers[:, 0], centers[:, 1], s=200, c=colors)

pyplot.savefig('E:\\Graduate\\Python\\machine_learning_in_action\\kmeans\\iter_%02d.png' % iter, format='png')

kmeans(X, 2, observer=observer)

4、结果

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航