您的位置:首页 > 其它

scikit-learn(工程中用的相对较多的模型介绍):2.3. Clustering(可用于特征的无监督降维)

2015-08-11 08:37 453 查看
参考:http://scikit-learn.org/stable/modules/clustering.html

在实际项目中,我们真的很少用到那些简单的模型,比如LR、kNN、NB等,虽然经典,但在工程中确实不实用。

今天我们不关注具体的模型,而关注无监督的聚类方法。

之所以关注无监督聚类方法,是因为,在实际项目中,我们除了使用PCA等方法降维外,有时候我们也会考虑使用聚类的方法降维特征

Overview of clustering methods:



A comparison of the clustering algorithms in scikit-learn

[thead]
[/thead]

Method nameParametersScalabilityUsecaseGeometry (metric used)
K-Meansnumber of clustersVery large n_samples, medium n_clusterswith MiniBatch
code
General-purpose, even cluster size, flat geometry, not too many clustersDistances between points
Affinity propagationdamping, sample preferenceNot scalable with n_samplesMany clusters, uneven cluster size, non-flat geometryGraph distance (e.g. nearest-neighbor graph)
Mean-shiftbandwidthNot scalable withn_samplesMany clusters, uneven cluster size, non-flat geometryDistances between points
Spectral clusteringnumber of clustersMedium n_samples, small n_clustersFew clusters, even cluster size, non-flat geometryGraph distance (e.g. nearest-neighbor graph)
Ward hierarchical clusteringnumber of clustersLarge n_samples andn_clustersMany clusters, possibly connectivity constraintsDistances between points
Agglomerative clusteringnumber of clusters, linkage type, distanceLarge n_samples andn_clustersMany clusters, possibly connectivity constraints, non Euclidean distancesAny pairwise distance
DBSCANneighborhood sizeVery large n_samples, medium n_clustersNon-flat geometry, uneven cluster sizesDistances between nearest points
Gaussian mixturesmanyNot scalableFlat geometry, good for density estimationMahalanobis distances to centers
Birchbranching factor, threshold, optional global clusterer.Large n_clusters andn_samplesLarge dataset, outlier removal, data reduction.Euclidean distance between points

内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: