针对toy datasets的不同聚类方法比较
2015-05-26 13:56
387 查看
Comparing different clustering algorithms on toy datasets
针对toy datasets的不同聚类方法比较原地址 http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
This example aims at showing characteristics of different clustering algorithms on datasets that are “interesting”but still in 2D. The last dataset is an example of a ‘null’situation for clustering: the data is homogeneous, andthere is no good clustering.
这个例子是为了说明不同聚类算法在2维空间下的特性。这些新数据是聚类分析针对“空”的情形:数据是均匀则没有好的簇。
While these examples give some intuition about the algorithms,this intuition might not apply to very high dimensional data.
而这些例子仅仅给出算法的一些直观的例子,这些例子未必适用于高维数据。
The results could be improved by tweaking the parameters foreach clustering strategy, for instance setting the number ofclusters for the methods that needs this parameterspecified. Note that affinity propagation has a tendency to create many clusters.
Thus in this example its two parameters(damping and per-point preference) were set to to mitigate this behavior.
可以通过修改参数来提高聚类效果。例如通过设置簇的个数来设置。需要注意的是,临近扩展成为一种生成簇的趋势。因此,例子中有两个参数(衰减和点偏)被用来设置。
Python source code:
plot_cluster_comparison.py
print(__doc__) import time import numpy as np import matplotlib.pyplot as plt from sklearn import cluster, datasets from sklearn.neighbors import kneighbors_graph from sklearn.preprocessing import StandardScaler np.random.seed(0) # Generate datasets. We choose the size big enough to see the scalability # of the algorithms, but not too big to avoid too long running times n_samples = 1500 noisy_circles = datasets.make_circles(n_samples=n_samples, factor=.5, noise=.05) noisy_moons = datasets.make_moons(n_samples=n_samples, noise=.05) blobs = datasets.make_blobs(n_samples=n_samples, random_state=8) no_structure = np.random.rand(n_samples, 2), None colors = np.array([x for x in 'bgrcmykbgrcmykbgrcmykbgrcmyk']) colors = np.hstack([colors] * 20) clustering_names = [ 'MiniBatchKMeans', 'AffinityPropagation', 'MeanShift', 'SpectralClustering', 'Ward', 'AgglomerativeClustering', 'DBSCAN', 'Birch'] plt.figure(figsize=(len(clustering_names) * 2 + 3, 9.5)) plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05, hspace=.01) plot_num = 1 datasets = [noisy_circles, noisy_moons, blobs, no_structure] for i_dataset, dataset in enumerate(datasets): X, y d1fb = dataset # normalize dataset for easier parameter selection X = StandardScaler().fit_transform(X) # estimate bandwidth for mean shift bandwidth = cluster.estimate_bandwidth(X, quantile=0.3) # connectivity matrix for structured Ward connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False) # make connectivity symmetric connectivity = 0.5 * (connectivity + connectivity.T) # create clustering estimators ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True) two_means = cluster.MiniBatchKMeans(n_clusters=2) ward = cluster.AgglomerativeClustering(n_clusters=2, linkage='ward', connectivity=connectivity) spectral = cluster.SpectralClustering(n_clusters=2, eigen_solver='arpack', affinity="nearest_neighbors") dbscan = cluster.DBSCAN(eps=.2) affinity_propagation = cluster.AffinityPropagation(damping=.9, preference=-200) average_linkage = cluster.AgglomerativeClustering( linkage="average", affinity="cityblock", n_clusters=2, connectivity=connectivity) birch = cluster.Birch(n_clusters=2) clustering_algorithms = [ two_means, affinity_propagation, ms, spectral, ward, average_linkage, dbscan, birch] for name, algorithm in zip(clustering_names, clustering_algorithms): # predict cluster memberships t0 = time.time() algorithm.fit(X) t1 = time.time() if hasattr(algorithm, 'labels_'): y_pred = algorithm.labels_.astype(np.int) else: y_pred = algorithm.predict(X) # plot plt.subplot(4, len(clustering_algorithms), plot_num) if i_dataset == 0: plt.title(name, size=18) plt.scatter(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), s=10) if hasattr(algorithm, 'cluster_centers_'): centers = algorithm.cluster_centers_ center_colors = colors[:len(centers)] plt.scatter(centers[:, 0], centers[:, 1], s=100, c=center_colors) plt.xlim(-2, 2) plt.ylim(-2, 2) plt.xticks(()) plt.yticks(()) plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'), transform=plt.gca().transAxes, size=15, horizontalalignment='right') plot_num += 1 plt.show()
相关文章推荐
- Ios XML 不同解析方法比较
- 三个数的比较 不同的实现方法java
- 五种不同的 URL 参数解析方法的性能比较
- C-index/C-statistic 计算的5种不同方法及比较
- java找出两个文件之间不同的内容--该方法属于比较笨的方法
- 用OO方法开发ALV(实现横向alv head,与纵向ALV head比较相似,只有少许不同)
- JS中传递参数的几种不同方法比较
- 不同版本的SQL Server之间数据导出导入的方法及性能比较
- 在获取前台传送过来的中文时,往往会出现乱乱码.而针对不同的浏览器,解决方法不同.
- datalist 的用法。也是增删改查,但是比较智能。用数据绑定的方式,可以有不同的显示方法,下面是对一个表的增删改查的参考代码
- Ubuntu 下针对不同扩展名的安装包进行安装的方法
- 五种不同的 URL 参数解析方法的性能比较
- carrot2聚类的不同聚类算法 选用方法
- 如何比较两个EXCEL 文件的不同(各个EXCEL版本的方法)
- 四种聚类方法的比较
- 五种不同的 URL 参数解析方法的性能比较
- 关于ready和load方法作用于不同情况下的比较
- 四种聚类方法之比较
- 几种常用聚类方法的比较
- C#比较二个数组并找出相同或不同元素的方法