# K-Means for Octave

Googling for K-Means for Octave brought me to this Matlab/Octave script http://www.christianherta.de/kmeans.html by Christian Herta. Works great, except that it ran very slow. Here is my improved version to run faster, most of the expensive for loops have been replaced with faster ones. The new version runs orders of magnitude faster.

```
function[centroid, pointsInCluster, assignment]= myKmeans(data, nbCluster)
% usage
% function[centroid, pointsInCluster, assignment]=
% myKmeans(data, nbCluster)
%
% Output:
% centroid: matrix in each row are the Coordinates of a centroid
% pointsInCluster: row vector with the nbDatapoints belonging to
% the centroid
% assignment: row Vector with clusterAssignment of the dataRows
%
% Input:
% data in rows
% nbCluster : nb of centroids to determine
%
% (c) by Christian Herta ( www.christianherta.de )
% Modified by Nghia Ho to improve speed

data_dim = length(data(1,:));
nbData   = length(data(:,1));

% init the centroids randomly
data_min = min(data);
data_max = max(data);
data_diff = data_max .- data_min ;

% every row is a centroid
centroid = rand(nbCluster, data_dim);
centroid = centroid .* repmat(data_diff, nbCluster, 1) + repmat(data_min, nbCluster, 1);

% no stopping at start
pos_diff = 1.;

% main loop until

while pos_diff &gt; 0.0
% E-Step
assignment = [];

% assign each datapoint to the closest centroid

if(nbCluster == 1) % special case
assignment = ones(size(data,1), 1);
else
dists = [];
for c = 1: nbCluster
d = data - repmat(centroid(c,:), size(data,1), 1);
d = d .* d;
d = sum(d, 2); % sum the row values

dists = [dists d];
end

[a, assignment] = min(dists');

assignment = assignment';
end

% for the stoppingCriterion
oldPositions = centroid;

% M-Step
% recalculate the positions of the centroids
centroid = zeros(nbCluster, data_dim);
pointsInCluster = zeros(nbCluster, 1);

for c = 1: nbCluster
indexes = find(assignment == c);
d = data(indexes,:);
centroid(c,:) = sum(d,1);
pointsInCluster(c,1) = size(d,1);

if( pointsInCluster(c, 1) != 0)
centroid( c , : ) = centroid( c, : ) / pointsInCluster(c, 1);
else
% set cluster randomly to new position
centroid( c , : ) = (rand( 1, data_dim) .* data_diff) + data_min;
end
end

%stoppingCriterion
pos_diff = sum (sum( (centroid .- oldPositions).^2 ) );
end
end```

## 10 thoughts on “K-Means for Octave”

1. ConfusingResults says:

Hi I’m new to machine learning, and I’ve been experimenting with the code provided above with moderate size matrix of numeric data… and the output(s) i’m looking at doesn’t look like the [expected] output.

For k=5, I’m getting what looks like the following:

>> myKmeans(QuranStats, 5)
ans =

5.1429e+00 1.8314e+02 3.5771e+03 1.5215e+04 2.9000e+01
5.0522e+01 4.3739e+01 4.2713e+02 1.8140e+03 2.9000e+01
1.8846e+01 1.1015e+02 1.5193e+03 6.3109e+03 2.9000e+01
3.0842e+01 8.0526e+01 9.2779e+02 3.8708e+03 2.9000e+01
8.7038e+01 1.8962e+01 1.0837e+02 4.6113e+02 2.5558e+01

This output is a matrix of 5 rows, one for each [alleged] cluster? (My data set has 5 attributes, so I’m guessing each cluster is described according to each attribute.) I believe the numbers should be representing probabilities, but these numbers are very high. (I should double-check my feature-scaling arithmetic.)

I get similar results for other values of k. Is something going wrong with the process? If so, could you please explain. If not, would you please give me an idea of how to interpret this effectively?

Thank you greatly for your time and attention.

1. nghiaho12 says:

K-means has no concept of probabilities. What it outputs is the cluster’s centroids/centre in N-dimensional space.

1. ConfusingResults says:

Thank you very much for your reply. I had a suspicion that these rows are coordinates for the k # of cluster centers. How do I know how many of my data points are assigned to each centroid?

1. nghiaho12 says:

That would be the “assignment” matrix that gets returned.

2. ConfusingResults says:

There should be 3 components to my outputs. But there’s only (as I showed initially). Any ideas on what could be going wrong?

1. nghiaho12 says:

You need to run it like

[centroid, pointsInCluster, assignment]= myKmeans(QuranStats, 5)

3. Keren says:

I may find a bug:
input:
data = [1 2;3 4;1 5];
n = 2;
[center,npoint, ass] = myKmeans(data,n);

output:
npoint =
2
2

I think this is obviously wrong……

1. nghiaho12 says:

I’ve updated the code slightly. But interestingly the data you given puts it into an infinite loop. The centroid keeps getting resetted randomly when no points are assigned to it. If you add 2 or more points to ‘data’ it seems to work. Something unstable with so few points.

4. Dmitry Negoda says:

Found a bug here. Instead of
centroid(c,:) = sum(d);

there should be
centroid(c,:) = sum(d,1);

Otherwise when there is one row in ‘d’, sum(d) returns sum of its elements instead of ‘d’ itself.

1. nghiaho12 says:

Good catch!