Clustering in Network Analysis
- Finding Groups in Network Communities
- When given a dataset, finding groups is often the first task
- Methods mentioned: PCA and t-SNE
- Purpose: Identifying clusters within unstructured data
- Definition and Characteristics
- Clustering is a method for identifying similar groups or data points in datasets
- It falls under unsupervised learning
- Mathematical Complexity
- For N samples and k clusters: k^N possible assignments
- Example given: N=100, k=3: 3^100 = 5.19 × 10^47 assignments
- This demonstrates why exhaustive search is impossible
- Common Clustering Methods
- k-means clustering
- Gaussian mixture models
- Hierarchical clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Statistical Analysis and P-values
- P-value Definition
- Probability under the controlled experiment
- Significant level is typically 5%
- If p-value is under 5%, the method is considered effective
- Example Case Study
- Treatment group: 29 deaths
- Control group: 63 deaths
- P-value: 0.012
- Conclusion: Since p-value is under 5%, the treatment is considered effective
- Error Types Mentioned
- Type I Error: The null hypothesis distribution curve below shows the probabilities of obtaining all possible results if the study were repeated with new samples and the null hypothesis were true in the population.
- Type II Error: A Type II error means not rejecting the null hypothesis when it’s actually false. This is not quite the same as “accepting” the null hypothesis, because hypothesis testing can only tell you whether to reject the null hypothesis.
Comments NOTHING