What is the difference between clustering and classification?
In clustering the class labels of training data are unknown
Given a set of measurements, observations with the aim of establishing the existence of classes or clusters in the data
Classification is supervision for training data are accompanied by labels indicating the class of the observations
Classes are known in advance
Data is classified based on information extracted from the training set
What is clustering?
Grouping a set of data objects into clusters which are collections of data objects
Similar to one another within the same cluster
This is similar to the objects in other clusters
What are typical usages of clustering?
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
What are general applications of clustering?
Preprocessing - as a data reduction
Pattern recognition
Image processing
Spatial data analysis (creating thematic maps)
Business intelligence
Biology e.g. clustering of gene expression data
What are major clustering approaches?
Partitioning algorithms
Probabilistic Model-Based Clustering
Density-based (connectivity and density functions)
Hierarchical algorithms
What is the basic concept of partitioning algorithms?
What is the popular heurisitc method for partitioning?
Choose k representatives for clusters e.g. randomly
Improve these representatives randomly iteratively:
assign each object to the cluster it “fits best” in the current clustering
Compute new cluster representatives based on these assignments
Repeat until the change in the objective function from one iteration to the next drops below a threshold
In k-means what is each cluster represented by? How is this different to k-medoid?
By the center of the cluster in k-means
By one of its objects in k-medoid
What is the basic idea/ objective behind k-means Clustering?
Idea
Find a clustering such that the intra-cluster variation of each cluster is small and use the center of the centroid as representative
Objective
For a given k, form k groups so that the sum of the (squared) distances between the mean of the groups and their elements is minimal
How are objects and the measure for compactness in k-means Clustering mathematically expressed?
Objects p = (p1,…,pd) are points in a d-dimensional vector space
Why are efficient heuristic algorithms needed for k-means Clustering?
Optimal partitioning: argmin SSE(C) is NP-hard
What is Lloyd’s algorithm?
What is a Voronoi Diagram?
For a given set of points P = p1,…,pk, a Voronoi diagram partitions the data space into Voronoi cells, one cell per point
The cell of a point p covers all data points in the data space for which p is the nearest neighbor among the points from P
What are properties of Voronoi cells?
The Voronoi cells of two neighbouring points pi, pj are separated by the perpendicular hyperplane between pi and pj
Voronoi cells are intersections of half spaces and this convex regions
What is the strength of k-Means?
Relatively efficient: O(tkn
n: # objects
k: # clusters
t # iterations
Easy implementation
What are the weaknesses of k-means?
Applicable only when the mean is defined
Need to specify k, the number of clusters, in advance
Sensitive to noisy data and outliers
Clusters are forced to convex space partitions (Voronoi cells)
Result and runtime strongly depend on the intial partition; often terminates at a local optmum: however methods for a good intialization exist
What are alternatives to mean representatives of k-means? Why have alternatives?
Median: (artificial) representative object “in the middle”
Mode: value that appears most often
Medoid: representative object “in the middle”
Find k representatives so that the sum of total distances (TD) between objects and their closest representative is minimal (more robust against outliers)
What is the idea behind k-median?
If there is an ordering on the data use median instead of mean
Compute median separately per dimension
What is the naive partitioning clustering method?
Choose sample A of the dataset
Cluster A and use centers as initialization
What is the k-means++ initialization clustering method?
Select first center uniformly at random
Choose next point with probability proportional to the squared distance to the nearest center already chosen
Repeat until k centers have been selected
Guarantees an approximation ratio of O(logk)
What is the general method of initializing clustering method partitioning?
Repeat with different initial centers and choose result with lowest clustering error
What is the Silhoutte-Coefficient?
Quality measure for k-means or k-medoid clusterings that is not monotonic over k
Monotonic means that an increase in k does not guarantee an increase or decrease in the quality of the clustering
What is the basic idea behind the Silhouette Coefficient?
How good is the clustering = how appropriate is the mapping of objects to clusters
Elements in cluster should be “similar to their representative
measure the average distance of objects to their representative
Elements in different clusters should be “dissimilar
measure the average distance of objects to alternative clusters (i.e. second close cluster)
Silhoutte Coefficient mathematical expression
How can one interpret the Silhoutte Coefficient?
What is expectation maximization?
Statistical approach for finding maximum likelihood estimates of parameters in probabilistic models
EM can be used as a clustering algorithm
What is the main idea behind using EM as a clustering method?
Define clusters as probability distributions -> each object has a certain probability of belonging to each cluster
Iteratively improve the parameters of each distribution until some qiu
What is a Gaussian distribution?
How is EM clustering represented?
What is the goal of the maximum likelihood estimator (MLE)?
What is the mathematical approach to MLE?
What are the problems of finding the optimal parameters for MLE?
Non-linear mutual dependencies
Optimizing the Gaussian of cluster j depends on all other Gaussians
There is no closed-form solution
Approximation through iterative optimization procedures
Break the mutual dependencies by optimizing the cluster means and s.d. independently
What does the iterative optimization of EM look like?
What does EM clustering achieve?
EM obtains a soft clustering (each object belongs to each cluster with a certain probability) reflecting the uncertainty of the most appropriate assignment
It is modified to obtain a partitioning variant: assign each object to the cluster to which is belongs with the highest probability
How does EM compare to k-Means?
Superior to k-Means for clusters of varying size or clusters having differing variances
more accurate data representation
What is the result and runtime of EM clustering dependent on?
Computational effort for t iterations = O(tnk), t is quite high in many cases
Strongly dependent on the initial assignment
do multiple random starts and choose the final estimate with highest likelihood
initialize with clustering algorithms
local maxima and initialization issues have been addressed in EM extensions
Also depend on a proper choice of k
What is the k tradeoff problem in EM model selection?
If k is too high, the mixture may overfit the data
If k is too love, the mixture may not be flexible enough to approximate the data
How can EM model quality be measured?
What is the solution for model selection for determining the parameter k?
What is the idea behind the iterative mode search clustering algorithm?
Find modes in the point density
What is the algorithm behind the iterative mode search clustering algorithm?
Select a window size epsilon and a starting position m
Calculate the mean of all points inside the window W(m)
Shift the window to that position
Repeat until convergence
Apply iterative mode search for each data point. Group those that converge to the same mode (basin of attraction)
What is the disadvantage of the mean shift clustering algorithm?
It has a relatively high complexity
What are the advantages of the mean shift clustering algorithm?
Clusters can have arbitrary shape/size -> no restriction to convex shapes
Number of clusters is determined automatically
Robust to outliers
Easy implementation and parallelization
Single parameter - epsilon
Suppose by spatial indexing to reduce complexity
What is the basic idea behind density-based clustering?
Clusters are dense regions in the data space, separated by regions of lower density
How is the local point density at a point q defined?
By 2 parameters:
epsilon radius for the neighbourhood of point q
N_eps(q) = {p in D | dist(p,q) <= eps}
MinPts: minimum number of points in the given neighbourhood N_eps(q)
In density-based clustering what is meant by a core point, a density-reachable point, and density-connected?
Core point:
q is called a core point w.r.t eps, MinPts if |N_eps(q)| >= MinPts
Density-reachable:
p is directly density-reachable from q w.r.t. eps, MinPts if:
p is in N_eps(q) and
q is a core object w.r.t eps, MinPts
Density-connected:
p is density-connected to a point q w.r.t. eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t eps, MinPts
How are density-based clusters defined? (2 key concepts)
Maximality:
if q is in C and p is density-reachable from q then p is in C
Connectivity:
Each object in C is density-connected to all other objects in C
In density-based clustering, how are clusters and noise defined?
What is the DBSCAN algorithm and what does it stand for?
Density-Based Spatial Clustering of Applications with Noise
What is a heuristic used in determining eps and MinPts in DBSCAN?
Fix a value for MinPts (default 2d-1)
Compute the k-distance for all points p in D (distance from p to its nearest k-neighbour), with k=MinPts
Create a k-distance plot, showing the k-distances of all objects, in decreasing order
The user selects “border object” o from the MinPts-distance plot: eps is set to MinPts-distance(o)
How can databases support DBSCAN?
DBSCAN can be efficiently built on top of similarity join operations
For very large DB, efficient join techniques are available
What are the disadvantages of the DBSCAN algorithm?
Input parameters may be difficult to determine
In some situations very sensitive to input parameters setting
What are the advantages of the DBSCAN algorithm?
Clusters can have arbitrary shape and size -> no restriction to convex shapes
Can separate clusters from surrounding noise
Good complexity O(n^2)
Can be supported by spatial index structures -> O(nlogn)
What are the general steps for spectral clustering?
Construct fully-connected KNN graph with eps neigbourhood
Created adjacency, degree, laplacian matrix
Compute eigenvalues, eigenvalues of L matrix, select top eigenvectors
Apply k-means on k eigenvectors
Map results back to original data
What are the advantages of spectral clustering?
Easy to implement (cap)
No assumptions on the shape of the clusters
What are the disadvantages of spectral clustering?
May be sensitive to the construction of the similarity graph
Slow runtime - k smallest eigenvectors can be computed in O(n^3)
What are the basic notions behind hierarchical clustering?
Hierarchical decomposition of the dataset into a set of nested clusters, given a set of nested clusters
Results represented by dendrogram
Nodes in the dendrogram represent possible clusters
Dendrograms can be constructed bottom up (agglomerative approach) or top down (divisive aproach)
How can one interpret a dendrogram?
The root represents the whole dataset
A lead represents a single object in the data set
An internal node represents the union of all objects in its sub tree
The height of an internal node represents the distance between its two child nodes
What is the generic algorithm behind Agglomerative (Bottom up) Hierarchical Clustering?
Initially each object forms its own cluster
Consider all pairwise distances between initial clusters
Merge the closest pair (A,B) in the set of the current clusters into a new cluster C = A U B
Remove A and B from the set of current clusrters, insert C
If the set of current clusters contains only C, STOP
Else: determine the distance between C and all other clusters in the set of current clusters and go to step 3
What is agglomerative hierarchical clustering and what does it require?
Agglomerative hierarchical clustering begins by treating each individual data point as its own cluster. It then iteratively merges the most similar clusters until all the data points belong to a single, large cluster.
It requires a distance function for clusters
What different types of distance functions are there for clusters?
Single-link: min distance between 2 elements in diff clusters
Complete-link: max distance between 2 elements in diff clusters
Average-link: the average distance between all points in diff clusters
What is the general approach for the top down method of hierarchical clustering?
Initially all objects form one cluster
Repeat until all clusters are singletons:
Choose a cluster to split
Replace the chose cluster with the sub-clusters and split into two
What is the DIANA algorithm?
Top down divisive clustering method
How does Agglomerative vs Divisive hierarchical clustering algorithms compare?
Both need n-1 steps
Agg has to consider (n,2) combinations in the first step
Div has to potentially consider 2^(n-1) -1 possibilities to split the data
Div more complex since it needs a second “flat” clustering algorithm to split
Agg decides based on local patterns
Div uses global data distribution
What is the main idea behind OPTICS, and what does it stand for?
Ordering Points To Identify the Clustering Structure
Density based hierarchical clustering algorithm
Main idea:
Visit each point
Always make a shortest jump
Maintain two data structures
seedList: stores all objects with shortest reachability distance seen so far (‘distance of a jump to that point’) in ascending order; organized as a heap
clusterOrder: resulting cluster order is constructed sequentially (order of objects + reachability distances)
In OPTICS, after the seed list and the reach distances have been recorded, what happens?
Plot the points together with their reachability distances. Use the order in whih they were returned by the algorithm, this plot:
represents the density-based clustering structure
is easy to analyse
independent of the dimensionality of the data
significant valleys represent clusters
What effect does parameter sensitivity have on OPTICS?
Relative insensitivity to parameter settings
Good result if parameters are just “large enough”
What are the disadvantages of hierarchical clustering?
May not scale well
Runtime O(n^2logn^2) for standard methods, O(n^2) for OPTICS
User has to choose final clustering
What are the advantages of hierarchical clustering?
Does not require the number of clusters to be known in advance
No (standard methods) or very robust parameters (OPTICS)
Computes a complete hierarchy of clusters
Good result visualizations integrated into the methods
A “flat” partition can be derived afterwards (via cut through dendrogram or reachability plot)
What is the general approach of the solution to the clustering evaluation problem?
Given a dataset D, a clustering C = C1,…,Ck and a ground truth G = G1,…Gl
Instead of comparing cluster and ground truth labels directly, consider all pairs of objects. Check whether they have the same label in G and if they have the same in C.
In cluster evaluation how are TP, FP, TN, FN formalized?
In cluster evaluation, how are recall, precision, and F1 evaluated?
In cluster evaluation, how are the Rand Index, Adjusted Rand Index, and Jaccard Coefficient evaluated?
In cluster evaluation, how is cohesion measured?
In cluster evaluation, how is separation measured?
What is the problem with the cohesion and separation internal measures?
They are suitable for convex clusters, but not for stretched clusters
Last changed9 months ago