您的位置:首页 > 其它

programming collective intelligence --chapter03 笔记

2014-04-18 16:43 363 查看
分级聚类(Hierarchical Clustering)

1 描述

通过连续不断地将最为相似的群组两两合并来构造出一个群组的层级结构,其中每个群组都是从单一元素开始。每次迭代过程中,分级聚类算法会计算每两个群组间的距离,并将距离最近的两个群组合并为一个新的群组,直到只剩一个群组为止。

2 Python预备知识

(1) 文件数据读取

file(filename[, mode[, bufsize]])
为file类型的构造函数,内置了open()函数,一般对文件操作不建议直接采用

Constructor function for the
file type, described further in section
File Objects. The constructor’s arguments are the same as those of the
open() built-in function described below.

When opening a file, it’s preferable to use
open() instead of invoking this constructor directly.
file is more suited to type testing (for example, writing
isinstance(f,
file)).

注意:当filename为路径时,可能出现“IOError: [Errno 22] invalid mode ('r') or filename:”,由于路径中包含‘\’导致路径出现问题,通过在文件路径前加'r'来解决‘/’带来的影响

(2) 文本数据整理

pList=string.strip().split('\t')字符串去除('\t'或者空格)处理为list类型

string.strip(s[,
chars]) 删除以chars中的字符为首尾的字符,当为空时,默认删除空格以及制表
Return a copy of the string with leading and trailing characters removed. If chars is omitted or
None, whitespace characters are removed. If given and not
None, chars must be a string; the characters in the string will be stripped from the both ends of the string this method is called on.

string.split(s[,
sep[, maxsplit]])将s分割为以word为单位的list
Return a list of the words of the string s. If the optional second argument sep is absent or
None, the words are separated by arbitrary strings of whitespace characters (space, tab, newline, return, formfeed). If the second argument sep is present and not
None, it specifies a string to be used as the word separator. The returned list will then have one more item than the number of non-overlapping occurrences of the separator in the string. The optional third argument maxsplit
defaults to 0. If it is nonzero, at most maxsplit number of splits occur, and the remainder of the string is returned as the final element of the list (thus, the list will have at most
maxsplit+1 elements).

The behavior of split on an empty string depends on the value of sep. If sep is not specified, or specified as
None, the result will be an empty list. If sep is specified as any string, the result will be a list containing one element which is an empty string.

(3) 类定义

class bicluster:

def__init__(self,vec,left=None,right=None,distance=0.0,id=None):

self.left=left//左孩子节点

self.right=right//右孩子节点

self.distance=distance

self.id=id

self.vec=vec

3 分级聚类算法描述

其实就是形成一棵二叉树

形成过程:

while len>1

遍历每一个配对,寻找最小距离

将最小距离的一组形成新的聚类数据

删除配对数据,将新聚类数据加入

打印:

利用二叉树遍历方式进行递归遍历

4 利用PIL绘制树状图
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: