Overview
Unsupervised learning algorithms use different distance or similarity/dissimilarity measures between each pair of observations to group data into different clusters. This is typically done by computing a distance matrix containing distances between each pair of observations.
Types of Distance Measures
1. Euclidean Distance
- Definition: Calculated as the square root of the sum of squared differences between two vectors
- Formula for two points P(x₁, y₁) and Q(x₂, y₂):
d = √[(x₂ - x₁)² + (y₂ - y₁)²]
2. Manhattan Distance
- Also known as city block distance or L1 distance
- Calculates distance between two points by following an orthogonal, grid-like path
- Formula:
Manhattan(A,B) = |x₁ - x₂| + |y₁ - y₂|
Data Scaling Techniques
Why Scaling is Important
- Datasets often contain variables with different units (e.g., kg, hour, km)
- A quantity of 1000ml is not directly comparable to 1000g
- Ensures each variable has similar weightage in terms of its likelihood of contributing to the algorithm's decision-making
- Prevents variables with high numerical values from unduly influencing the algorithm's predictive power
1. Normalization (Min-Max Scaling)
- Scales values to lie between 0 and 1
- Formula:
y = (x - min)/(max - min)
Where:
- y: normalized version of the variable
- x: variable of interest
- min: minimum value of variable x in the dataset
- max: maximum value of variable x in the dataset
2. Standardization (Z-score Normalization)
- Transforms data to have a mean of 0 and standard deviation of 1
- Formula:
y = (x - mean)/std_dev
Where:
- y: standardized version of the variable
- x: variable of interest
- mean: arithmetic mean of the variable x in the dataset
- std_dev: standard deviation of the variable x in the dataset
This is related to ways about clustering.
Comments NOTHING