Major Clustering Approaches
spectral clustering
mean shift
Partitioning Algorithms: Basic Concept
Goal:
Construct a partitioning of a database D of n objects into a set of k (k <= n) clusters minimizing an objective function.
Exhaustively enumerating all possible partitionings into k sets in order to find the global minimum is too expensive.
k-Means Clustering
k-Means Clustering: Algorithm
k-Means: Voronoi Model for Convex Cluster Regions
For a given set of points P = {p1, . . . , pk } (here: cluster representatives), a Voronoi diagram partitions the data space into Voronoi cells, one cell per point
The cell of a point pi covers all points in the data space for which p is the nearest neighbors among the points from P
k-Means: Discussion
k-Means variants
median: ordinal
mode: categorical
medoid: metric (=PAM)
K-Means/Median/Mode/Medoid Clustering: Discussion
Initialization of Partitioning Clustering Methods
Choice of the Parameter k
how to measure the quality of a clustering?
measure should not be monotonic over k
The measures for the compactness of a clustering SSE and TD are monotonously decreasing with increasing value of k
—> Silhouette-Coecient
Quality measure for k-means or k-medoid clusterings that is not monotonic over k.
The Silhouette Coecient
reading the silouhette coefficient
Probabilistic Model-Based Methods
EM- clustering model
Iterative Optimization
EM: Turning the Soft Clustering into a Partitioning
EM: Discussion
find parameter k
If k is too high, the mixture may overfit the data
If k is too low, the mixture may not be flexible enough to approximate the data
Mean shift
Iterative Mode Search
Mean Shift: Core Algorithm
Apply iterative mode search for each data point. Group those that converge to the same mode (called Basin of Attraction).
Mean Shift: Extensions
Weighted Mean
Use different weights for the points in the window, with weights wx , resp. calculated by some kernel
Binning
First quantise data points to grid. Apply iterative mode seeking only once per bin.
Mean Shift: Discussion
Density-Based Clustering
DBSCAN
Density-Reachable vs Density-Connected
DBSCAN Algorithm
Density-Based Spatial Clustering of Applications with Noise
with
Evaluation: based on recursive database traversal.
DBSCAN - find parameters
DBSCAN: Discussion
Spectral Clustering
Find global minimum cut
General Steps
make graph from data
Create adjacency Matrix
Caluclate Eigenvalues and eigenvectors
choose useful number of eigenvalues
apply k means on eingenvectors
map results back to orginal data
Spectral Clustering : Preliminaries
Spectral Clustering: Discussion
Hierarchical Clustering: Basic Notions
result: dendrogramm
Dendrogram can be constructed
bottom-up (agglomerative approach) or
top down (divisive approach)
Link Methods - agglomerative clustering
single, complete, and average
Divisive Hierarchical Clustering
Discussion Agglomerative vs. Divisive HC
Density-Based Hierarchical Clustering
Core Distance and Reachability Distance
OPTICS
OPTICS: The Reachability Plot
Hierarchical Clustering: Discussion
Cluster Evaluation - external
Sc : cluster label
SB: ground truth cluster label
Cluster Evaluation external measures
external: needs ground truth
recall, prec, f1, confusion matrix
rand index, Adjusted Rand Index (ARI), Jaccard Coecient
(Shannon) Entropy, Conditional Entropy:
Mutual Information, Normalized Mutual Information (NMI), Adjusted Mutual Information (AMI)
Cluser evaluation - internal measures
no additional information needed
Cohesion
Average distance between objects of the same cluster
Cohesion of clustering is equal to weighted mean of the clusters’
cohesions.
seperation
Separation between to clusters: Average distance between pairs
Separation of clustering is equal to weighted mean of the clusters’ separations.
Last changed2 years ago