Partitioning Around Medoids (PAM)

K-Means Characteristics

  • Cluster centers can be arbitrary points in space
  • Very sensitive to outliers!
  • Outliers move cluster means very far away

PAM Overview

  • PAM allows us to be less sensitive to outliers
  • PAM needs us to pay a large price in computation time
  • Cluster center must be an observation, so every mean must be samples themselves

K-means Applications

K-means can be useful in:

  • Wireless sensor networks
  • Diagnostic system
  • Customer Segmentation
  • Call record detail analysis

PCA and K-means Integration

  • We can use PCA to identify outliers
  • Then we can remove outliers then do K-means clustering
  • So we should always do PCA so we don't have to pay the computation price here

K-means Disadvantages

  • It's a hard assignment because every data point is assigned to only 1 cluster
  • If a point is equally close, it will be randomly assigned to one of the other

Gaussian Mixture Model (GMM)

  • GMM is a soft version of K-means thus 1 sample is allowed to be a part of 2 clusters
  • If a point is equally close to red and green: It will have a 50% probability of being red and 50% probability of being green
  • If a point is a bit close to red, it could have 55% probability of being red, 40% probability of being green and 5% probability of being blue
  • GMM lets us solve the problem of having round clusters and also allows us to have ellipsoidial shaped clusters
  • GMM models each cluster by using Gaussian distribution
  • The number of clusters can be determined in a statistically sound way, for example, using Bayesian Information Criterion. So we will have a unique optimum for the objective function
  • But GMM is very slow since it's a lot more complex than K-means