Partitioning Around Medoids (PAM)
K-Means Characteristics
- Cluster centers can be arbitrary points in space
- Very sensitive to outliers!
- Outliers move cluster means very far away
PAM Overview
- PAM allows us to be less sensitive to outliers
- PAM needs us to pay a large price in computation time
- Cluster center must be an observation, so every mean must be samples themselves
K-means Applications
K-means can be useful in:
- Wireless sensor networks
- Diagnostic system
- Customer Segmentation
- Call record detail analysis
PCA and K-means Integration
- We can use PCA to identify outliers
- Then we can remove outliers then do K-means clustering
- So we should always do PCA so we don't have to pay the computation price here
K-means Disadvantages
- It's a hard assignment because every data point is assigned to only 1 cluster
- If a point is equally close, it will be randomly assigned to one of the other
Gaussian Mixture Model (GMM)
- GMM is a soft version of K-means thus 1 sample is allowed to be a part of 2 clusters
- If a point is equally close to red and green: It will have a 50% probability of being red and 50% probability of being green
- If a point is a bit close to red, it could have 55% probability of being red, 40% probability of being green and 5% probability of being blue
- GMM lets us solve the problem of having round clusters and also allows us to have ellipsoidial shaped clusters
- GMM models each cluster by using Gaussian distribution
- The number of clusters can be determined in a statistically sound way, for example, using Bayesian Information Criterion. So we will have a unique optimum for the objective function
- But GMM is very slow since it's a lot more complex than K-means
Comments NOTHING