K-means Clustering Overview

K-means clustering is an unsupervised learning algorithm. It tries to aggregate similar objects into groups called clusters. K refers to the number of clusters required.

Core Concept

The concept of a centroid, which is the geometric center of a cluster, is used to determine the clusters that k-means finds in the dataset.

Visual Understanding

An image where all the categories or groups appear to be mixed up (the same colour), while the image on the right is after clustering where groups of similar data points seems to be clustered together and depicted with different colours.
To human eye, it's difficult to understand the image on the left. So we use K-means clustering.

Steps in K-means Clustering

Choose the number of clusters K
Initialize the centroids
Assign each data point to the closest centroid
Update the centroid by taking the mean of the cluster
Repeat steps 3 and 4 until convergence i.e. No discernible change in centroid is observed

Technical Details

For step 3, to assign each data point to the nearest centroid, we use the distance between the centroid and the data point. The distance can be found using Euclidean distance:

d = √[(x2-x1)² + (y2-y1)²]

Therefore, the mean is the point that minimizes the sum of squared Euclidean distances, so the mean will be the centroid. The lowest squared distance needs the mean of the cluster. So the lower the squared distance (SSD), the more compact the cluster is. So SSD can be used to quantify cluster quality.

Important Considerations

It's always good to standardize/normalize the data points with distance-based algorithms like k-means.
Since we initialize centroids randomly, different initializations may lead to different clusters, so it might find local optimum instead of global optimum. It's better to iterate with different initialization and select one with the least SSD.

Evaluation Methods

For K-means clustering we need a fixed value for k
To find the optimal K number, one method is "Elbow method"
In Elbow method, we take a range of K. For each value of K:
- We calculate the sum of squares within clusters (WCSS), WCSS is the sum of SSD
- Then we plot WCSS vs K, the plot shape looks like an elbow
- As K increases, WCSS will decrease
- The optimal is at the elbow
- After that as K increases, the fall in WCSS is not that significant

Assumptions

K-means Clustering is limited to linear cluster boundaries
All clusters are of the same size (compactness is the same)
Clusters have the same extent in every direction. This assumption would not be true in a dataset where different measurements are in different units
Clusters have similar numbers of points assigned to them