您的位置：首页 > 编程语言

Coursera 《Machine Learning》编程作业7：K-means聚类和主成分分析

2017-09-30 19:24 579 查看

Coursera 《Machine Learning》编程作业7：K-means聚类和主成分分析

K-means

K-means是一个迭代算法，算法接受一个未标记的数据集，然后将数据聚类成不同的组，假设我们想要将数据聚类成n个组，其方法为:

1. 首先选择K个随机的点，称为聚类中心（cluster centroids）

2. 对于数据集中的每一个数据，按照距离K个中心点的距离，将其与距离最近的中心点关联起来，与同一个中心点关联的所有点聚成一类

3. 计算每一个组的平均值，将该组所关联的中心点移动到平均值的位置

4. 重复步骤2-4直至中心点不再变化

matlab描述如下：

% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
% Cluster assignment step: Assign each data point to the
% closest centroid. idx(i) corresponds to cˆ(i), the index
%of the centroid assigned to example i
idx = findClosestCentroids(X, centroids);

% Move centroid step: Compute means based on centroid
% assignments
centroids = computeMeans(X, idx, K);
end

先在简单的2D数据集上运用k-means来对算法有一个直观地了解

随机初始化

作业中直接设置了K = 3，初始化聚类中心位置 = [3 3; 6 2; 8 5]

在实践中，一种初始化策略是随机选取K个数据点作为质心，代码实现如下：

function centroids = kMeansInitCentroids(X, K)
%KMEANSINITCENTROIDS This function initializes K centroids that are to be
%used in K-Means on the dataset X
%   centroids = KMEANSINITCENTROIDS(X, K) returns K initial centroids to be
%   used with the K-Means on the dataset X
%

% You should return this values correctly
centroids = zeros(K, size(X, 2));

% ====================== YOUR CODE HERE ======================
% Instructions: You should set centroids to randomly chosen examples from
%               the dataset X
%
% Initialize the centroids to be random examples

% Randomly reorder the indices of examples
randidx = randperm(size(X, 1));
% Take the first K examples as centroids
centroids = X(randidx(1:K), :);

% =============================================================

end

K-均值的一个问题在于，它有可能会停留在一个局部最小值处，而这取决于初始化的情况。

为了解决这个问题，我们通常需要多次运行K-均值算法，每一次都重新进行随机初始化，最后再比较多次运行K-均值的结果，选择代价函数最小的结果。这种方法在K较小的时候(2-10) 还是可行的，但是如果K较大，这么作也可能不会有明显地改善。

寻找最近重心

对于每一个点找到重心j，使得欧氏距离最小：

代码实现如下：

function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
%   in idx for a dataset X where each row is a single example. idx = m x 1
%   vector of centroid assignments (i.e. each entry in range [1..K])
%

% Set K
K = size(centroids, 1);

% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);

% ====================== YOUR CODE HERE ======================
% Instructions: Go over every example, find its closest centroid, and store
%               the index inside idx at the appropriate location.
%               Concretely, idx(i) should contain the index of the centroid
%               closest to example i. Hence, it should be a value in the
%               range 1..K
%
% Note: You can use a for-loop over the examples to compute this.
%
for i = 1:size(X,1)
M = sum((repmat(X(i,:),K,1) - centroids).^2,2);
minimum = min(M);
idx(i,:) = find(M == minimum);
end
%===========================================================

end

计算重心

第k个团的重心计算方法如下：

代码实现如下：

function centroids = computeCentroids(X, idx, K)
%COMPUTECENTROIDS returns the new centroids by computing the means of the
%data points assigned to each centroid.
%   centroids = COMPUTECENTROIDS(X, idx, K) returns the new centroids by
%   computing the means of the data points assigned to each centroid. It is
%   given a dataset X where each row is a single data point, a vector
%   idx of centroid assignments (i.e. each entry in range [1..K]) for each
%   example, and K, the number of centroids. You should return a matrix
%   centroids, where each row of centroids is the mean of the data points
%   assigned to it.
%

% Useful variables
[m n] = size(X);

% You need to return the following variables correctly.
centroids = zeros(K, n);

% ====================== YOUR CODE HERE ======================
% Instructions: Go over every centroid and compute mean of all points that
%               belong to it. Concretely, the row vector centroids(i, :)
%               should contain the mean of the data points assigned to
%               centroid i.
%
% Note: You can use a for-loop over the centroids to compute this.
%
for i = 1:K
id = find(idx == i);
centroids(i,:) = sum(X(id,:)) / numel(id);
end

%=============================================================

end

运行K-means

两部结合在一起就得到了可运行的K-means，迭代10次后得到下图：

主成分分析

首先先在一个简单的2D数据集上实现PCA

2D数据集

实现PCA

第一步是均值归一化。我们需要计算出所有特征的均值，然后令xj=xj-μj。如果特征是在不同的数量级上，我们还需要将其除以标准差σ2。

第二步是计算协方差矩阵（covariance matrix）Σ：

第三步是计算协方差矩阵的特征向量（eigenvectors）:

在Matlab/Octave里我们可以利用奇异值分解（singular value decomposition）来求解，[U, S, V] = svd(sigma)。

数据归一化

function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X
%   FEATURENORMALIZE(X) returns a normalized version of X where
%   the mean value of each feature is 0 and the standard deviation
%   is 1. This is often a good preprocessing step to do when
%   working with learning algorithms.

mu = mean(X);
X_norm = bsxfun(@minus, X, mu);

sigma = std(X_norm);
X_norm = bsxfun(@rdivide, X_norm, sigma);

% ============================================================

end

实现

计算协方差矩阵并计算协方差矩阵的特征向量，代码实现如下：

function [U, S] = pca(X)
%PCA Run principal component analysis on the dataset X
%   [U, S, X] = pca(X) computes eigenvectors of the covariance matrix of X
%   Returns the eigenvectors U, the eigenvalues (on diagonal) in S
%

% Useful values
[m, n] = size(X);

% You need to return the following variables correctly.
U = zeros(n);
S = zeros(n);

% ====================== YOUR CODE HERE ======================
% Instructions: You should first compute the covariance matrix. Then, you
%               should use the "svd" function to compute the eigenvectors
%               and eigenvalues of the covariance matrix.
%
% Note: When computing the covariance matrix, remember to divide by m (the
%       number of examples).
%
Sigma = X' * X / m;
[U,S,V] = svd(Sigma);

% =========================================================================

end

可视化特征向量

PCA降维

对于一个n×n维度的矩阵，特征矩阵U是一个具有与数据之间最小投射误差的方向向量构成的矩阵。如果我们希望将数据从n维降至k维，我们只需要从U矩阵中选取前K个向量，获得一个n×k维度的矩阵，我们用Ureduce表示，然后通过如下计算获得要求的新特征向量Z：

代码实现如下：

function Z = projectData(X, U, K)
%PROJECTDATA Computes the reduced data representation when projecting only
%on to the top k eigenvectors
%   Z = projectData(X, U, K) computes the projection of
%   the normalized inputs X into the reduced dimensional space spanned by
%   the first K columns of U. It returns the projected examples in Z.
%

% You need to return the following variables correctly.
Z = zeros(size(X, 1), K);

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the projection of the data using only the top K
%               eigenvectors in U (first K columns).
%               For the i-th example X(i,:), the projection on to the k-th
%               eigenvector is given as follows:
%                    x = X(i, :)';
%                    projection_k = x' * U(:, k);
%
Z = X * U(:,1:K);
% =============================================================

end

还原数据

在压缩过数据后，我们可以采用如下方法来近似地获得原有的特征：

代码实现如下：

function X_rec = recoverData(Z, U, K)
%RECOVERDATA Recovers an approximation of the original data when using the
%projected data
%   X_rec = RECOVERDATA(Z, U, K) recovers an approximation the
%   original data that has been reduced to K dimensions. It returns the
%   approximate reconstruction in X_rec.
%

% You need to return the following variables correctly.
X_rec = zeros(size(Z, 1), size(U, 1));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the approximation of the data by projecting back
%               onto the original space using the top K eigenvectors in U.
%
%               For the i-th example Z(i,:), the (approximate)
%               recovered data for dimension j is given as follows:
%                    v = Z(i, :)';
%                    recovered_j = v' * U(j, 1:K)';
%
%               Notice that U(j, 1:K) is a row vector.
%
X_rec = Z * U(:,1:K)';

% =============================================================

end

可视化投影

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 机器学习

相关文章推荐

新的分享

章节导航

Coursera 《Machine Learning》 编程作业7：K-means聚类和主成分分析