Coursera 《Machine Learning》 编程作业7:K-means聚类和主成分分析
2017-09-30 19:24
579 查看
Coursera 《Machine Learning》 编程作业7:K-means聚类和主成分分析
K-means
K-means是一个迭代算法,算法接受一个未标记的数据集,然后将数据聚类成不同的组,假设我们想要将数据聚类成n个组,其方法为:1. 首先选择K个随机的点,称为聚类中心(cluster centroids)
2. 对于数据集中的每一个数据,按照距离K个中心点的距离,将其与距离最近的中心点 关联起来,与同一个中心点关联的所有点聚成一类
3. 计算每一个组的平均值,将该组所关联的中心点移动到平均值的位置
4. 重复步骤2-4直至中心点不再变化
matlab描述如下:
% Initialize centroids centroids = kMeansInitCentroids(X, K); for iter = 1:iterations % Cluster assignment step: Assign each data point to the % closest centroid. idx(i) corresponds to cˆ(i), the index %of the centroid assigned to example i idx = findClosestCentroids(X, centroids); % Move centroid step: Compute means based on centroid % assignments centroids = computeMeans(X, idx, K); end
先在简单的2D数据集上运用k-means来对算法有一个直观地了解
随机初始化
作业中直接设置了K = 3,初始化聚类中心位置 = [3 3; 6 2; 8 5]在实践中,一种初始化策略是随机选取K个数据点作为质心,代码实现如下:
function centroids = kMeansInitCentroids(X, K) %KMEANSINITCENTROIDS This function initializes K centroids that are to be %used in K-Means on the dataset X % centroids = KMEANSINITCENTROIDS(X, K) returns K initial centroids to be % used with the K-Means on the dataset X % % You should return this values correctly centroids = zeros(K, size(X, 2)); % ====================== YOUR CODE HERE ====================== % Instructions: You should set centroids to randomly chosen examples from % the dataset X % % Initialize the centroids to be random examples % Randomly reorder the indices of examples randidx = randperm(size(X, 1)); % Take the first K examples as centroids centroids = X(randidx(1:K), :); % ============================================================= end
K-均值的一个问题在于,它有可能会停留在一个局部最小值处,而这取决于初始化的情况。
为了解决这个问题,我们通常需要多次运行K-均值算法,每一次都重新进行随机初始化,最后再比较多次运行K-均值的结果,选择代价函数最小的结果。这种方法在K较小的时候(2-10) 还是可行的,但是如果K较大,这么作也可能不会有明显地改善。
寻找最近重心
对于每一个点找到重心j,使得欧氏距离最小:代码实现如下:
function idx = findClosestCentroids(X, centroids) %FINDCLOSESTCENTROIDS computes the centroid memberships for every example % idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids % in idx for a dataset X where each row is a single example. idx = m x 1 % vector of centroid assignments (i.e. each entry in range [1..K]) % % Set K K = size(centroids, 1); % You need to return the following variables correctly. idx = zeros(size(X,1), 1); % ====================== YOUR CODE HERE ====================== % Instructions: Go over every example, find its closest centroid, and store % the index inside idx at the appropriate location. % Concretely, idx(i) should contain the index of the centroid % closest to example i. Hence, it should be a value in the % range 1..K % % Note: You can use a for-loop over the examples to compute this. % for i = 1:size(X,1) M = sum((repmat(X(i,:),K,1) - centroids).^2,2); minimum = min(M); idx(i,:) = find(M == minimum); end %=========================================================== end
计算重心
第k个团的重心计算方法如下:代码实现如下:
function centroids = computeCentroids(X, idx, K) %COMPUTECENTROIDS returns the new centroids by computing the means of the %data points assigned to each centroid. % centroids = COMPUTECENTROIDS(X, idx, K) returns the new centroids by % computing the means of the data points assigned to each centroid. It is % given a dataset X where each row is a single data point, a vector % idx of centroid assignments (i.e. each entry in range [1..K]) for each % example, and K, the number of centroids. You should return a matrix % centroids, where each row of centroids is the mean of the data points % assigned to it. % % Useful variables [m n] = size(X); % You need to return the following variables correctly. centroids = zeros(K, n); % ====================== YOUR CODE HERE ====================== % Instructions: Go over every centroid and compute mean of all points that % belong to it. Concretely, the row vector centroids(i, :) % should contain the mean of the data points assigned to % centroid i. % % Note: You can use a for-loop over the centroids to compute this. % for i = 1:K id = find(idx == i); centroids(i,:) = sum(X(id,:)) / numel(id); end %============================================================= end
运行K-means
两部结合在一起就得到了可运行的K-means,迭代10次后得到下图:主成分分析
首先先在一个简单的2D数据集上实现PCA2D数据集
实现PCA
第一步是均值归一化。我们需要计算出所有特征的均值,然后令xj=xj-μj。如果特征是在不同的数量级上,我们还需要将其除以标准差σ2。第二步是计算协方差矩阵(covariance matrix)Σ:
第三步是计算协方差矩阵的特征向量(eigenvectors):
在Matlab/Octave里我们可以利用奇异值分解(singular value decomposition)来求解,[U, S, V] = svd(sigma)。
数据归一化
function [X_norm, mu, sigma] = featureNormalize(X) %FEATURENORMALIZE Normalizes the features in X % FEATURENORMALIZE(X) returns a normalized version of X where % the mean value of each feature is 0 and the standard deviation % is 1. This is often a good preprocessing step to do when % working with learning algorithms. mu = mean(X); X_norm = bsxfun(@minus, X, mu); sigma = std(X_norm); X_norm = bsxfun(@rdivide, X_norm, sigma); % ============================================================ end
实现
计算协方差矩阵并计算协方差矩阵的特征向量,代码实现如下:function [U, S] = pca(X) %PCA Run principal component analysis on the dataset X % [U, S, X] = pca(X) computes eigenvectors of the covariance matrix of X % Returns the eigenvectors U, the eigenvalues (on diagonal) in S % % Useful values [m, n] = size(X); % You need to return the following variables correctly. U = zeros(n); S = zeros(n); % ====================== YOUR CODE HERE ====================== % Instructions: You should first compute the covariance matrix. Then, you % should use the "svd" function to compute the eigenvectors % and eigenvalues of the covariance matrix. % % Note: When computing the covariance matrix, remember to divide by m (the % number of examples). % Sigma = X' * X / m; [U,S,V] = svd(Sigma); % ========================================================================= end
可视化特征向量
PCA降维
对于一个n×n维度的矩阵,特征矩阵U是一个具有与数据之间最小投射误差的方向向量构成的矩阵。如果我们希望将数据从n维降至k维,我们只需要从U矩阵中选取前K个向量,获得一个n×k维度的矩阵,我们用Ureduce表示,然后通过如下计算获得要求的新特征向量Z:代码实现如下:
function Z = projectData(X, U, K) %PROJECTDATA Computes the reduced data representation when projecting only %on to the top k eigenvectors % Z = projectData(X, U, K) computes the projection of % the normalized inputs X into the reduced dimensional space spanned by % the first K columns of U. It returns the projected examples in Z. % % You need to return the following variables correctly. Z = zeros(size(X, 1), K); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the projection of the data using only the top K % eigenvectors in U (first K columns). % For the i-th example X(i,:), the projection on to the k-th % eigenvector is given as follows: % x = X(i, :)'; % projection_k = x' * U(:, k); % Z = X * U(:,1:K); % ============================================================= end
还原数据
在压缩过数据后,我们可以采用如下方法来近似地获得原有的特征:代码实现如下:
function X_rec = recoverData(Z, U, K) %RECOVERDATA Recovers an approximation of the original data when using the %projected data % X_rec = RECOVERDATA(Z, U, K) recovers an approximation the % original data that has been reduced to K dimensions. It returns the % approximate reconstruction in X_rec. % % You need to return the following variables correctly. X_rec = zeros(size(Z, 1), size(U, 1)); % ====================== YOUR CODE HERE ====================== % Instructions: Compute the approximation of the data by projecting back % onto the original space using the top K eigenvectors in U. % % For the i-th example Z(i,:), the (approximate) % recovered data for dimension j is given as follows: % v = Z(i, :)'; % recovered_j = v' * U(j, 1:K)'; % % Notice that U(j, 1:K) is a row vector. % X_rec = Z * U(:,1:K)'; % ============================================================= end
可视化投影
相关文章推荐
- scikit-learn做k-means做聚类分析的流程
- Coursera 《Machine Learning》 编程作业1:线性回归
- 主成分分析、因子分析、聚类分析的比较与应用
- Coursera概率图模型(Probabilistic Graphical Models)第一周编程作业分析
- 数据挖掘RapidMiner工具使用----聚类K-Means案例分析
- 使用 Spark MLlib 做 K-means 聚类分析
- 聚类分析:k-means和appropriate kmeans
- K-means空间聚类分析
- Coursera概率图模型(Probabilistic Graphical Models)第二周编程作业分析
- Coursera Machine Learning 第八周week8ex7 K-Means Clustering and PCA编程全套满分题目+注释
- Coursera 吴恩达DeepLearning.AI 第五课 sequence mode 编程作业1 Building a Recurrent Neural Network-Step by Step
- Coursera吴恩达机器学习课程 编程作业
- Coursera 吴恩达 Deep Learning 第二课 改善神经网络 Improving Deep Neural Networks 第二周 编程作业代码Optimization methods
- Coursera公开课-Machine_learing:编程作业
- 吴恩达Coursera深度学习课程 DeepLearning.ai 编程作业——Optimization Methods(2-2)
- 吴恩达Coursera深度学习课程 DeepLearning.ai 编程作业——Autonomous driving - Car detection(4.3)
- Coursera—machine learning(Andrew Ng)第五周编程作业
- Coursera普林斯顿大学算法课-编程作业1: Percolation
- Coursera公开课-Machine_learing:编程作业4
- Coursera—machine learning(Andrew Ng)第八周编程作业