大数据挖掘算法篇之K-Means实例
2013-12-19 11:20
549 查看
一、引言
K-Means算法是聚类算法中,应用最为广泛的一种。本文基于欧几里得距离公式:d = sqrt((x1-x2)^+(y1-y2)^)计算二维向量间的距离,作为聚类划分的依据,输入数据为二维数据两列数据,输出结果为聚类中心和元素划分结果。输入数据格式如下:
二、欧几里得距离:
欧几里得距离定义: 欧几里得距离( Euclidean distance)也称欧氏距离,在n维空间内,最短的线的长度即为其欧氏距离。它是一个通常采用的距离定义,它是在m维空间中两个点之间的真实距离。
在二维和三维空间中的欧式距离的就是两点之间的距离,二维的公式是
d = sqrt((x1-x2)^+(y1-y2)^)
三维的公式是
d=sqrt((x1-x2)^+(y1-y2)^+(z1-z2)^)
推广到n维空间,欧式距离的公式是
d=sqrt( ∑(xi1-xi2)^ ) 这里i=1,2..n
xi1表示第一个点的第i维坐标,xi2表示第二个点的第i维坐标
n维欧氏空间是一个点集,它的每个点可以表示为(x(1),x(2),...x(n)),其中x(i)(i=1,2...n)是实数,称为x的第i个坐标,两个点x和y=(y(1),y(2)...y(n))之间的距离d(x,y)定义为上面的公式.
欧氏距离看作信号的相似程度。 距离越近就越相似,就越容易相互干扰,误码率就越高。
三、代码示例
四、主调程序
五、输出结果
K-Means算法是聚类算法中,应用最为广泛的一种。本文基于欧几里得距离公式:d = sqrt((x1-x2)^+(y1-y2)^)计算二维向量间的距离,作为聚类划分的依据,输入数据为二维数据两列数据,输出结果为聚类中心和元素划分结果。输入数据格式如下:
18 2 2 0.0 0.0 1.0 0.0 0.0 1.0 2.0 1.0 1.0 2.0 2.0 2.0 2.0 0.0 0.0 2.0 7.0 6.0 7.0 7.0 7.0 8.0 8.0 6.0 8.0 7.0 8.0 8.0 8.0 9.0 9.0 7.0 9.0 8.0 9.0 9.0
二、欧几里得距离:
欧几里得距离定义: 欧几里得距离( Euclidean distance)也称欧氏距离,在n维空间内,最短的线的长度即为其欧氏距离。它是一个通常采用的距离定义,它是在m维空间中两个点之间的真实距离。
在二维和三维空间中的欧式距离的就是两点之间的距离,二维的公式是
d = sqrt((x1-x2)^+(y1-y2)^)
三维的公式是
d=sqrt((x1-x2)^+(y1-y2)^+(z1-z2)^)
推广到n维空间,欧式距离的公式是
d=sqrt( ∑(xi1-xi2)^ ) 这里i=1,2..n
xi1表示第一个点的第i维坐标,xi2表示第二个点的第i维坐标
n维欧氏空间是一个点集,它的每个点可以表示为(x(1),x(2),...x(n)),其中x(i)(i=1,2...n)是实数,称为x的第i个坐标,两个点x和y=(y(1),y(2)...y(n))之间的距离d(x,y)定义为上面的公式.
欧氏距离看作信号的相似程度。 距离越近就越相似,就越容易相互干扰,误码率就越高。
三、代码示例
/**************************************************************************** * * * KMEANS * * * *****************************************************************************/ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <conio.h> #include <math.h> // FUNCTION PROTOTYPES // DEFINES #define SUCCESS 1 #define FAILURE 0 #define TRUE 1 #define FALSE 0 #define MAXVECTDIM 20 #define MAXPATTERN 20 #define MAXCLUSTER 10 char *f2a(double x, int width){ char cbuf[255]; char *cp; int i,k; int d,s; cp=fcvt(x,width,&d,&s); if (s) { strcpy(cbuf,"-"); } else { strcpy(cbuf," "); } /* endif */ if (d>0) { for (i=0; i<d; i++) { cbuf[i+1]=cp[i]; } /* endfor */ cbuf[d+1]=0; cp+=d; strcat(cbuf,"."); strcat(cbuf,cp); } else { if (d==0) { strcat(cbuf,"."); strcat(cbuf,cp); } else { k=-d; strcat(cbuf,"."); for (i=0; i<k; i++) { strcat(cbuf,"0"); } /* endfor */ strcat(cbuf,cp); } /* endif */ } /* endif */ cp=&cbuf[0]; return cp; } // ***** Defined structures & classes ***** struct aCluster { double Center[MAXVECTDIM]; int Member[MAXPATTERN]; //Index of Vectors belonging to this cluster int NumMembers; }; struct aVector { double Center[MAXVECTDIM]; int Size; }; class System { private: double Pattern[MAXPATTERN][MAXVECTDIM+1]; aCluster Cluster[MAXCLUSTER]; int NumPatterns; // Number of patterns int SizeVector; // Number of dimensions in vector int NumClusters; // Number of clusters void DistributeSamples(); // Step 2 of K-means algorithm int CalcNewClustCenters();// Step 3 of K-means algorithm double EucNorm(int, int); // Calc Euclidean norm vector int FindClosestCluster(int); //ret indx of clust closest to pattern //whose index is arg public: void system(); int LoadPatterns(char *fname); // Get pattern data to be clustered void InitClusters(); // Step 1 of K-means algorithm void RunKMeans(); // Overall control K-means process void ShowClusters(); // Show results on screen void SaveClusters(char *fname); // Save results to file void ShowCenters(); }; //输出聚类中心 void System::ShowCenters(){ int i,j; printf("Cluster centers:\n"); for (i=0; i<NumClusters; i++) { Cluster[i].Member[0]=i; printf("ClusterCenter[%d]=(%f,%f)\n",i,Cluster[i].Center[0],Cluster[i].Center[1]); } /* endfor */ printf("\n"); getchar(); } //读取文件 int System::LoadPatterns(char *fname) { FILE *InFilePtr; int i,j; double x; if((InFilePtr = fopen(fname, "r")) == NULL) return FAILURE; fscanf(InFilePtr, "%d", &NumPatterns); // Read # of patterns 18数据量 fscanf(InFilePtr, "%d", &SizeVector); // Read dimension of vector 2维度 fscanf(InFilePtr, "%d", &NumClusters); // Read # of clusters for K-Means 2簇 for (i=0; i<NumPatterns; i++) { // For each vector for (j=0; j<SizeVector; j++) { // create a pattern fscanf(InFilePtr,"%lg",&x); // consisting of all elements Pattern[i][j]=x; } /* endfor */ } /* endfor */ //输出所有数据元素 printf("Input patterns:\n"); for (i=0; i<NumPatterns; i++) { printf("Pattern[%d]=(%2.3f,%2.3f)\n",i,Pattern[i][0],Pattern[i][1]); } /* endfor */ printf("\n--------------------\n"); getchar(); return SUCCESS; } //*************************************************************************** // InitClusters * // Arbitrarily assign a vector to each of the K clusters * // We choose the first K vectors to do this * //*************************************************************************** //初始化聚类中心 void System::InitClusters(){ int i,j; printf("Initial cluster centers:\n"); for (i=0; i<NumClusters; i++) { Cluster[i].Member[0]=i; for (j=0; j<SizeVector; j++) { Cluster[i].Center[j]=Pattern[i][j]; } /* endfor */ } /* endfor */ for (i=0; i<NumClusters; i++) { printf("ClusterCenter[%d]=(%f,%f)\n",i,Cluster[i].Center[0],Cluster[i].Center[1]); //untransplant } /* endfor */ printf("\n"); getchar(); } //运行KMeans void System::RunKMeans(){ int converged; int pass; pass=1; converged=FALSE; //第N次聚类 while (converged==FALSE) { printf("PASS=%d\n",pass++); DistributeSamples(); converged=CalcNewClustCenters(); ShowCenters(); getchar(); } /* endwhile */ } //在二维和三维空间中的欧式距离的就是两点之间的距离,二维的公式是 //d = sqrt((x1-x2)^+(y1-y2)^) //通过这种运算,就可以把所有列的属性都纳入进来 double System::EucNorm(int p, int c){ // Calc Euclidean norm of vector difference double dist,x; // between pattern vector, p, and cluster int i; // center, c. char zout[128]; char znum[40]; char *pnum; // pnum=&znum[0]; strcpy(zout,"d=sqrt("); printf("The distance from pattern %d to cluster %d is calculated as:\n",p,c); dist=0; for (i=0; i<SizeVector ;i++){ //拼写字符串 x=(Cluster[c].Center[i]-Pattern[p][i])*(Cluster[c].Center[i]-Pattern[p][i]); strcat(zout,f2a(x,4)); if (i==0) strcat(zout,"+"); //计算距离 dist += (Cluster[c].Center[i]-Pattern[p][i])*(Cluster[c].Center[i]-Pattern[p][i]); } /* endfor */ printf("%s)\n",zout); return dist; } //查找最近的群集 int System::FindClosestCluster(int pat){ int i, ClustID; double MinDist, d; MinDist =9.9e+99; ClustID=-1; for (i=0; i<NumClusters; i++) { d=EucNorm(pat,i); printf("Distance from pattern %d to cluster %d is %f\n\n",pat,i,sqrt(d)); if (d<MinDist) { MinDist=d; ClustID=i; } /* endif */ } /* endfor */ if (ClustID<0) { printf("Aaargh"); exit(0); } /* endif */ return ClustID; } // void System::DistributeSamples(){ int i,pat,Clustid,MemberIndex; //Clear membership list for all current clusters for (i=0; i<NumClusters;i++){ Cluster[i].NumMembers=0; } for (pat=0; pat<NumPatterns; pat++) { //Find cluster center to which the pattern is closest Clustid= FindClosestCluster(pat);//查找最近的聚类中心 printf("patern %d assigned to cluster %d\n\n",pat,Clustid); //post this pattern to the cluster MemberIndex=Cluster[Clustid].NumMembers; Cluster[Clustid].Member[MemberIndex]=pat; Cluster[Clustid].NumMembers++; } /* endfor */ } //计算新的群集中心 int System::CalcNewClustCenters(){ int ConvFlag,VectID,i,j,k; double tmp[MAXVECTDIM]; char xs[255]; char ys[255]; char nc1[20]; char nc2[20]; char *pnc1; char *pnc2; char *fpv; pnc1=&nc1[0]; pnc2=&nc2[0]; ConvFlag=TRUE; printf("The new cluster centers are now calculated as:\n"); for (i=0; i<NumClusters; i++) { //for each cluster pnc1=itoa(Cluster[i].NumMembers,nc1,10); pnc2=itoa(i,nc2,10); strcpy(xs,"Cluster Center"); strcat(xs,nc2); strcat(xs,"(1/"); strcpy(ys,"(1/"); strcat(xs,nc1); strcat(ys,nc1); strcat(xs,")("); strcat(ys,")("); for (j=0; j<SizeVector; j++) { // clear workspace tmp[j]=0.0; } /* endfor */ for (j=0; j<Cluster[i].NumMembers; j++) { //traverse member vectors VectID=Cluster[i].Member[j]; for (k=0; k<SizeVector; k++) { //traverse elements of vector tmp[k] += Pattern[VectID][k]; // add (member) pattern elmnt into temp if (k==0) { strcat(xs,f2a(Pattern[VectID][k],3)); } else { strcat(ys,f2a(Pattern[VectID][k],3)); } /* endif */ } /* endfor */ if(j<Cluster[i].NumMembers-1){ strcat(xs,"+"); strcat(ys,"+"); } else { strcat(xs,")"); strcat(ys,")"); } } /* endfor */ for (k=0; k<SizeVector; k++) { //traverse elements of vector tmp[k]=tmp[k]/Cluster[i].NumMembers; if (tmp[k] != Cluster[i].Center[k]) ConvFlag=FALSE; Cluster[i].Center[k]=tmp[k]; } /* endfor */ printf("%s,\n",xs); printf("%s\n",ys); } /* endfor */ return ConvFlag; } //输出聚类 void System::ShowClusters(){ int cl; for (cl=0; cl<NumClusters; cl++) { printf("\nCLUSTER %d ==>[%f,%f]\n", cl,Cluster[cl].Center[0],Cluster[cl].Center[1]); } /* endfor */ } void System::SaveClusters(char *fname){ }
四、主调程序
void main(int argc, char *argv[]) { System kmeans; /* if (argc<2) { printf("USAGE: KMEANS PATTERN_FILE\n"); exit(0); }*/ if (kmeans.LoadPatterns("KM2.DAT")==FAILURE ){ printf("UNABLE TO READ PATTERN_FILE:%s\n",argv[1]); exit(0); } kmeans.InitClusters(); kmeans.RunKMeans(); kmeans.ShowClusters(); }
五、输出结果
Input patterns: Pattern[0]=(0.000,0.000) Pattern[1]=(1.000,0.000) Pattern[2]=(0.000,1.000) Pattern[3]=(2.000,1.000) Pattern[4]=(1.000,2.000) Pattern[5]=(2.000,2.000) Pattern[6]=(2.000,0.000) Pattern[7]=(0.000,2.000) Pattern[8]=(7.000,6.000) Pattern[9]=(7.000,7.000) Pattern[10]=(7.000,8.000) Pattern[11]=(8.000,6.000) Pattern[12]=(8.000,7.000) Pattern[13]=(8.000,8.000) Pattern[14]=(8.000,9.000) Pattern[15]=(9.000,7.000) Pattern[16]=(9.000,8.000) Pattern[17]=(9.000,9.000) -------------------- Initial cluster centers: ClusterCenter[0]=(0.000000,0.000000) ClusterCenter[1]=(1.000000,0.000000) PASS=1 The distance from pattern 0 to cluster 0 is calculated as: d=sqrt( .0000+ .0000) Distance from pattern 0 to cluster 0 is 0.000000 The distance from pattern 0 to cluster 1 is calculated as: d=sqrt( 1.0000+ .0000) Distance from pattern 0 to cluster 1 is 1.000000 patern 0 assigned to cluster 0 The distance from pattern 1 to cluster 0 is calculated as: d=sqrt( 1.0000+ .0000) Distance from pattern 1 to cluster 0 is 1.000000 The distance from pattern 1 to cluster 1 is calculated as: d=sqrt( .0000+ .0000) Distance from pattern 1 to cluster 1 is 0.000000 patern 1 assigned to cluster 1 The distance from pattern 2 to cluster 0 is calculated as: d=sqrt( .0000+ 1.0000) Distance from pattern 2 to cluster 0 is 1.000000 The distance from pattern 2 to cluster 1 is calculated as: d=sqrt( 1.0000+ 1.0000) Distance from pattern 2 to cluster 1 is 1.414214 patern 2 assigned to cluster 0 The distance from pattern 3 to cluster 0 is calculated as: d=sqrt( 4.0000+ 1.0000) Distance from pattern 3 to cluster 0 is 2.236068 The distance from pattern 3 to cluster 1 is calculated as: d=sqrt( 1.0000+ 1.0000) Distance from pattern 3 to cluster 1 is 1.414214 patern 3 assigned to cluster 1 The distance from pattern 4 to cluster 0 is calculated as: d=sqrt( 1.0000+ 4.0000) Distance from pattern 4 to cluster 0 is 2.236068 The distance from pattern 4 to cluster 1 is calculated as: d=sqrt( .0000+ 4.0000) Distance from pattern 4 to cluster 1 is 2.000000 patern 4 assigned to cluster 1 The distance from pattern 5 to cluster 0 is calculated as: d=sqrt( 4.0000+ 4.0000) Distance from pattern 5 to cluster 0 is 2.828427 The distance from pattern 5 to cluster 1 is calculated as: d=sqrt( 1.0000+ 4.0000) Distance from pattern 5 to cluster 1 is 2.236068 patern 5 assigned to cluster 1 The distance from pattern 6 to cluster 0 is calculated as: d=sqrt( 4.0000+ .0000) Distance from pattern 6 to cluster 0 is 2.000000 The distance from pattern 6 to cluster 1 is calculated as: d=sqrt( 1.0000+ .0000) Distance from pattern 6 to cluster 1 is 1.000000 patern 6 assigned to cluster 1 The distance from pattern 7 to cluster 0 is calculated as: d=sqrt( .0000+ 4.0000) Distance from pattern 7 to cluster 0 is 2.000000 The distance from pattern 7 to cluster 1 is calculated as: d=sqrt( 1.0000+ 4.0000) Distance from pattern 7 to cluster 1 is 2.236068 patern 7 assigned to cluster 0 The distance from pattern 8 to cluster 0 is calculated as: d=sqrt( 49.0000+ 36.0000) Distance from pattern 8 to cluster 0 is 9.219544 The distance from pattern 8 to cluster 1 is calculated as: d=sqrt( 36.0000+ 36.0000) Distance from pattern 8 to cluster 1 is 8.485281 patern 8 assigned to cluster 1 The distance from pattern 9 to cluster 0 is calculated as: d=sqrt( 49.0000+ 49.0000) Distance from pattern 9 to cluster 0 is 9.899495 The distance from pattern 9 to cluster 1 is calculated as: d=sqrt( 36.0000+ 49.0000) Distance from pattern 9 to cluster 1 is 9.219544 patern 9 assigned to cluster 1 The distance from pattern 10 to cluster 0 is calculated as: d=sqrt( 49.0000+ 64.0000) Distance from pattern 10 to cluster 0 is 10.630146 The distance from pattern 10 to cluster 1 is calculated as: d=sqrt( 36.0000+ 64.0000) Distance from pattern 10 to cluster 1 is 10.000000 patern 10 assigned to cluster 1 The distance from pattern 11 to cluster 0 is calculated as: d=sqrt( 64.0000+ 36.0000) Distance from pattern 11 to cluster 0 is 10.000000 The distance from pattern 11 to cluster 1 is calculated as: d=sqrt( 49.0000+ 36.0000) Distance from pattern 11 to cluster 1 is 9.219544 patern 11 assigned to cluster 1 The distance from pattern 12 to cluster 0 is calculated as: d=sqrt( 64.0000+ 49.0000) Distance from pattern 12 to cluster 0 is 10.630146 The distance from pattern 12 to cluster 1 is calculated as: d=sqrt( 49.0000+ 49.0000) Distance from pattern 12 to cluster 1 is 9.899495 patern 12 assigned to cluster 1 The distance from pattern 13 to cluster 0 is calculated as: d=sqrt( 64.0000+ 64.0000) Distance from pattern 13 to cluster 0 is 11.313708 The distance from pattern 13 to cluster 1 is calculated as: d=sqrt( 49.0000+ 64.0000) Distance from pattern 13 to cluster 1 is 10.630146 patern 13 assigned to cluster 1 The distance from pattern 14 to cluster 0 is calculated as: d=sqrt( 64.0000+ 81.0000) Distance from pattern 14 to cluster 0 is 12.041595 The distance from pattern 14 to cluster 1 is calculated as: d=sqrt( 49.0000+ 81.0000) Distance from pattern 14 to cluster 1 is 11.401754 patern 14 assigned to cluster 1 The distance from pattern 15 to cluster 0 is calculated as: d=sqrt( 81.0000+ 49.0000) Distance from pattern 15 to cluster 0 is 11.401754 The distance from pattern 15 to cluster 1 is calculated as: d=sqrt( 64.0000+ 49.0000) Distance from pattern 15 to cluster 1 is 10.630146 patern 15 assigned to cluster 1 The distance from pattern 16 to cluster 0 is calculated as: d=sqrt( 81.0000+ 64.0000) Distance from pattern 16 to cluster 0 is 12.041595 The distance from pattern 16 to cluster 1 is calculated as: d=sqrt( 64.0000+ 64.0000) Distance from pattern 16 to cluster 1 is 11.313708 patern 16 assigned to cluster 1 The distance from pattern 17 to cluster 0 is calculated as: d=sqrt( 81.0000+ 81.0000) Distance from pattern 17 to cluster 0 is 12.727922 The distance from pattern 17 to cluster 1 is calculated as: d=sqrt( 64.0000+ 81.0000) Distance from pattern 17 to cluster 1 is 12.041595 patern 17 assigned to cluster 1 The new cluster centers are now calculated as: Cluster Center0(1/3)( .000+ .000+ .000), (1/3)( .000+ 1.000+ 2.000) Cluster Center1(1/15)( 1.000+ 2.000+ 1.000+ 2.000+ 2.000+ 7.000+ 7.000+ 7.000+ 8 .000+ 8.000+ 8.000+ 8.000+ 9.000+ 9.000+ 9.000), (1/15)( .000+ 1.000+ 2.000+ 2.000+ .000+ 6.000+ 7.000+ 8.000+ 6.000+ 7.000+ 8.00 0+ 9.000+ 7.000+ 8.000+ 9.000) Cluster centers: ClusterCenter[0]=(0.000000,1.000000) ClusterCenter[1]=(5.866667,5.333333)
相关文章推荐
- 大数据挖掘算法篇之K-Means实例
- 数据挖掘算法之 k-means
- 数据挖掘十大经典算法学习之K均值(K-means)聚类算法
- 数据挖掘算法之 K-means
- 数据挖掘十大算法翻译——2K-means
- 数据挖掘聚合算法K-Means
- 【十大经典数据挖掘算法】k-means
- (2)数据挖掘算法之k-means
- 数据挖掘十大经典算法之二:K-means
- 数据挖掘-聚类分析:k-平均(k-Means)算法实现(C++)
- 数据挖掘回顾十:聚类算法之 K均值 (K-Means) 算法
- 数据挖掘算法-k-means
- 数据挖掘经典算法
- 数据挖掘十大经典算法(四) The Apriori algorithm
- 十大经典数据挖掘算法之C4.5算法
- IT大牛的数据挖掘算法到架构师等的职业进化
- 数据挖掘十大经典算法
- 程序员面试、算法研究、编程艺术、红黑树、数据挖掘5大系列集锦
- 数据挖掘十大经典算法
- 程序员面试、算法研究、编程艺术、红黑树、数据挖掘5大系列集锦