What is clustering ?
Clustering tries to find groups of data that are similar and assign them a unique label
unsupervised
Name some hirachial clustering methods
single linkage (dist between clusters is dist between closest points)
Pro: number of clusters
Con: comblike trees
complete linkage (dist between clusters is dist between furthest points)
Pro: compact classes
Con: one outlier can alter groups
average linkage (dist between clusters is avg dist between furthest and closest)
Pro: similar size and variance
Con: also not robust
Wards method ( Maximizes the between group sum of squares while minimzing the sum of square inside a group)
Pro missing an inertia
Con: small classes with high variabliity
Name a method to evaluate the quality of a clustering result
WSS (Withing group squared distances)
Can be used to find ideal numbe of clusters by plotting WSS against num clusters
Calinski-Harabasz index
WSS combined with BSS (between group sums)
Whats cluster selection via stability
The dataset is split in half and each half is clustered
Check stability of the results by comparing both clusterings and how the differ
The less they differ the better is the selection of clusters
Last changed5 months ago