Distance Measures in Unsupervised Learning

Overview

Unsupervised learning algorithms use different distance or similarity/dissimilarity measures between each pair of observations to group data into different clusters. This is typically done by computing a distance matrix containing distances between each pair of observations.

Types of Distance Measures

1. Euclidean Distance

Definition: Calculated as the square root of the sum of squared differences between two vectors
Formula for two points P(x₁, y₁) and Q(x₂, y₂):

d = √[(x₂ - x₁)² + (y₂ - y₁)²]

2. Manhattan Distance

Also known as city block distance or L1 distance
Calculates distance between two points by following an orthogonal, grid-like path
Formula:

Manhattan(A,B) = |x₁ - x₂| + |y₁ - y₂|

Data Scaling Techniques

Why Scaling is Important

Datasets often contain variables with different units (e.g., kg, hour, km)
A quantity of 1000ml is not directly comparable to 1000g
Ensures each variable has similar weightage in terms of its likelihood of contributing to the algorithm's decision-making
Prevents variables with high numerical values from unduly influencing the algorithm's predictive power