AI-017: 练习：K-means Clustering-优快云博客

本文链接：https://blog.youkuaiyun.com/hanjingjava/article/details/83585220

本文详细介绍了K-means聚类算法的工作流程，包括随机初始化聚类中心、计算样本所属聚类中心、更新聚类中心及迭代过程。通过具体步骤和伪代码，展示了算法如何对二维数据进行有效聚类。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

K-means 聚类算法：

随机初始化聚类中心
计算样本所属聚类中心
根据上一步的聚类，计算各个聚类中点的均值，设置为新的聚类中心
递归2.3步骤

待分类的二维数据：

1. 随机初始化聚类中心，随机选择样本中的K个，K为期望的聚类中心个数：

% Load an example dataset that we will be using
load('ex7data2.mat');

% Select an initial set of centroids
K = 3; % 3 Centroids
initial_centeroids = kMeansInitCentroids(X, K);

function centroids = kMeansInitCentroids(X, K)
    % Randomly reorder the indices of examples
    randidx = randperm(size(X,1));
    % Take the first K examples as centroids
    centroids = X(randidx(1:K),:);

2. 计算样本所属聚类中心

% Find the closest centroids for the examples using the
% initial_centroids
idx = findClosestCentroids(X, initial_centroids);

function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
%   in idx for a dataset X where each row is a single example. idx = m x 1 
%   vector of centroid assignments (i.e. each entry in range [1..K])
%

% Set K
K = size(centroids, 1);

% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);

% ====================== YOUR CODE HERE ======================
% Instructions: Go over every example, find its closest centroid, and store
%               the index inside idx at the appropriate location.
%               Concretely, idx(i) should contain the index of the centroid
%               closest to example i. Hence, it should be a value in the 
%               range 1..K
%
% Note: You can use a for-loop over the examples to compute this.
%
% size of X
m = size(X,1);
for i=1:m
 c_i = 1;
 minDist = sum((X(i,:) - centroids(1,:)).^2);
 for j=2:K
   anotherDist = sum((X(i,:) - centroids(j,:)).^2);
   if(anotherDist < minDist) minDist = anotherDist;
   c_i = j;
 end
 idx(i) = c_i;
end

% =============================================================

end

运行结果：

3. 根据上一步的聚类，计算各个聚类中点的均值，设置为新的聚类中心

%  Compute means based on the closest centroids found in the previous part.
centroids = computeCentroids(X, idx, K);

function centroids = computeCentroids(X, idx, K)
%COMPUTECENTROIDS returns the new centroids by computing the means of the 
%data points assigned to each centroid.
%   centroids = COMPUTECENTROIDS(X, idx, K) returns the new centroids by 
%   computing the means of the data points assigned to each centroid. It is
%   given a dataset X where each row is a single data point, a vector
%   idx of centroid assignments (i.e. each entry in range [1..K]) for each
%   example, and K, the number of centroids. You should return a matrix
%   centroids, where each row of centroids is the mean of the data points
%   assigned to it.
%
% Useful variables
[m n] = size(X);

% You need to return the following variables correctly.
centroids = zeros(K, n);


% ====================== YOUR CODE HERE ======================
% Instructions: Go over every centroid and compute mean of all points that
%               belong to it. Concretely, the row vector centroids(i, :)
%               should contain the mean of the data points assigned to
%               centroid i.
%
% Note: You can use a for-loop over the centroids to compute this.
%
for i=1:K
 %find all the point assigned to centroid i
 xi = X(idx==i,:);
 %point number of points assigned to centroid i
 ck = size(xi,1);
 %caltulate means
 centroids(i,:) = (1/ck) * sum(xi);
end
% =============================================================


end

4. 递归2.3步骤

% Settings for running K-Means
K = 3;
max_iters = 10;

% Run K-Means algorithm. The 'true' at the end tells our function to plot
% the progress of K-Means
[centroids, idx] = runkMeans(X, initial_centroids, max_iters, true);

function [centroids, idx] = runkMeans(X, initial_centroids, ...
                                      max_iters, plot_progress)
%RUNKMEANS runs the K-Means algorithm on data matrix X, where each row of X
%is a single example
%   [centroids, idx] = RUNKMEANS(X, initial_centroids, max_iters, ...
%   plot_progress) runs the K-Means algorithm on data matrix X, where each 
%   row of X is a single example. It uses initial_centroids used as the
%   initial centroids. max_iters specifies the total number of interactions 
%   of K-Means to execute. plot_progress is a true/false flag that 
%   indicates if the function should also plot its progress as the 
%   learning happens. This is set to false by default. runkMeans returns 
%   centroids, a Kxn matrix of the computed centroids and idx, a m x 1 
%   vector of centroid assignments (i.e. each entry in range [1..K])
%
% Set default value for plot progress
if ~exist('plot_progress', 'var') || isempty(plot_progress)
    plot_progress = false;
end

% Plot the data if we are plotting progress
if plot_progress
    figure;
    hold on;
end

% Initialize values
[m n] = size(X);
K = size(initial_centroids, 1);
centroids = initial_centroids;
previous_centroids = centroids;
idx = zeros(m, 1);

% Run K-Means
for i=1:max_iters
    % Output progress
    fprintf('K-Means iteration %d/%d...\n', i, max_iters);
    if exist('OCTAVE_VERSION')
        fflush(stdout);
    end
    
    % For each example in X, assign it to the closest centroid
    idx = findClosestCentroids(X, centroids);
    
    % Optionally, plot progress here
    if plot_progress
        plotProgresskMeans(X, centroids, previous_centroids, idx, K, i);
        previous_centroids = centroids;
        fprintf('Press enter to continue.\n');
        pause;
    end
    
    % Given the memberships, compute new centroids
    centroids = computeCentroids(X, idx, K);
end
% Hold off if we are plotting progress
if plot_progress
    hold off;
end

end

运行效果，三个聚类中心逐渐趋于合理：