Overview

Unsupervised learning algorithms use different distance or similarity/dissimilarity measures between each pair of observations to group data into different clusters. This is typically done by computing a distance matrix containing distances between each pair of observations.

Types of Distance Measures

1. Euclidean Distance

  • Definition: Calculated as the square root of the sum of squared differences between two vectors
  • Formula for two points P(x₁, y₁) and Q(x₂, y₂):
d = √[(x₂ - x₁)² + (y₂ - y₁)²]

2. Manhattan Distance

  • Also known as city block distance or L1 distance
  • Calculates distance between two points by following an orthogonal, grid-like path
  • Formula:
Manhattan(A,B) = |x₁ - x₂| + |y₁ - y₂|

Data Scaling Techniques

Why Scaling is Important

  1. Datasets often contain variables with different units (e.g., kg, hour, km)
  2. A quantity of 1000ml is not directly comparable to 1000g
  3. Ensures each variable has similar weightage in terms of its likelihood of contributing to the algorithm's decision-making
  4. Prevents variables with high numerical values from unduly influencing the algorithm's predictive power

1. Normalization (Min-Max Scaling)

  • Scales values to lie between 0 and 1
  • Formula:
y = (x - min)/(max - min)

Where:

  • y: normalized version of the variable
  • x: variable of interest
  • min: minimum value of variable x in the dataset
  • max: maximum value of variable x in the dataset

2. Standardization (Z-score Normalization)

  • Transforms data to have a mean of 0 and standard deviation of 1
  • Formula:
y = (x - mean)/std_dev

Where:

  • y: standardized version of the variable
  • x: variable of interest
  • mean: arithmetic mean of the variable x in the dataset
  • std_dev: standard deviation of the variable x in the dataset

This is related to ways about clustering.