What is the difference between clustering and classification?

In clustering the class labels of training data are unknown

Given a set of measurements, observations with the aim of establishing the existence of classes or clusters in the data

Classification is supervision for training data are accompanied by labels indicating the class of the observations

Classes are known in advance

Data is classified based on information extracted from the training set

What is clustering?

Grouping a set of data objects into clusters which are collections of data objects

Similar to one another within the same cluster

This is similar to the objects in other clusters

What are typical usages of clustering?

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms

What are general applications of clustering?

Preprocessing - as a data reduction

Pattern recognition

Image processing

Spatial data analysis (creating thematic maps)

Business intelligence

Biology e.g. clustering of gene expression data

What are major clustering approaches?

Partitioning algorithms

Probabilistic Model-Based Clustering

Density-based (connectivity and density functions)

Hierarchical algorithms

What is the basic concept of partitioning algorithms?

What is the popular heurisitc method for partitioning?

Choose k representatives for clusters e.g. randomly

Improve these representatives randomly iteratively:

assign each object to the cluster it “fits best” in the current clustering

Compute new cluster representatives based on these assignments

Repeat until the change in the objective function from one iteration to the next drops below a threshold

In k-means what is each cluster represented by? How is this different to k-medoid?

By the center of the cluster in k-means

By one of its objects in k-medoid

What is the basic idea/ objective behind k-means Clustering?

Idea

Find a clustering such that the intra-cluster variation of each cluster is small and use the center of the centroid as representative

Objective

For a given k, form k groups so that the sum of the (squared) distances between the mean of the groups and their elements is minimal

How are objects and the measure for compactness in k-means Clustering mathematically expressed?

Objects p = (p1,…,pd) are points in a d-dimensional vector space

Why are efficient heuristic algorithms needed for k-means Clustering?

Optimal partitioning: argmin SSE(C) is NP-hard

What is Lloyd’s algorithm?

What is a Voronoi Diagram?

For a given set of points P = p1,…,pk, a Voronoi diagram partitions the data space into Voronoi cells, one cell per point

The cell of a point p covers all data points in the data space for which p is the nearest neighbor among the points from P

What are properties of Voronoi cells?

The Voronoi cells of two neighbouring points pi, pj are separated by the perpendicular hyperplane between pi and pj

Voronoi cells are intersections of half spaces and this convex regions

What is the strength of k-Means?

Relatively efficient: O(tkn

n: # objects

k: # clusters

t # iterations

Easy implementation

What are the weaknesses of k-means?

Applicable only when the mean is defined

Need to specify k, the number of clusters, in advance

Sensitive to noisy data and outliers

Clusters are forced to convex space partitions (Voronoi cells)

Result and runtime strongly depend on the intial partition; often terminates at a local optmum: however methods for a good intialization exist

What are alternatives to mean representatives of k-means? Why have alternatives?

Median: (artificial) representative object “in the middle”

Mode: value that appears most often

Medoid: representative object “in the middle”

Find k representatives so that the sum of total distances (TD) between objects and their closest representative is minimal (more robust against outliers)

What is the idea behind k-median?

If there is an ordering on the data use median instead of mean

Compute median separately per dimension

What is the naive partitioning clustering method?

Choose sample A of the dataset

Cluster A and use centers as initialization

What is the k-means++ initialization clustering method?

Select first center uniformly at random

Choose next point with probability proportional to the squared distance to the nearest center already chosen

Repeat until k centers have been selected

Guarantees an approximation ratio of O(logk)

What is the general method of initializing clustering method partitioning?

Repeat with different initial centers and choose result with lowest clustering error

What is the Silhoutte-Coefficient?

Quality measure for k-means or k-medoid clusterings that is not monotonic over k

Monotonic means that an increase in k does not guarantee an increase or decrease in the quality of the clustering

What is the basic idea behind the Silhouette Coefficient?

How good is the clustering = how appropriate is the mapping of objects to clusters

Elements in cluster should be “similar to their representative

measure the average distance of objects to their representative

Elements in different clusters should be “dissimilar

measure the average distance of objects to alternative clusters (i.e. second close cluster)

Silhoutte Coefficient mathematical expression

How can one interpret the Silhoutte Coefficient?

What is expectation maximization?

Statistical approach for finding maximum likelihood estimates of parameters in probabilistic models

EM can be used as a clustering algorithm

What is the main idea behind using EM as a clustering method?

Define clusters as probability distributions -> each object has a certain probability of belonging to each cluster

Iteratively improve the parameters of each distribution until some qiu

What is a Gaussian distribution?

How is EM clustering represented?

What is the goal of the maximum likelihood estimator (MLE)?

What is the mathematical approach to MLE?

What are the problems of finding the optimal parameters for MLE?

Non-linear mutual dependencies

Optimizing the Gaussian of cluster j depends on all other Gaussians

There is no closed-form solution

Approximation through iterative optimization procedures

Break the mutual dependencies by optimizing the cluster means and s.d. independently

What does the iterative optimization of EM look like?

What does EM clustering achieve?

EM obtains a soft clustering (each object belongs to each cluster with a certain probability) reflecting the uncertainty of the most appropriate assignment

It is modified to obtain a partitioning variant: assign each object to the cluster to which is belongs with the highest probability

How does EM compare to k-Means?

Superior to k-Means for clusters of varying size or clusters having differing variances

more accurate data representation

What is the result and runtime of EM clustering dependent on?

Computational effort for t iterations = O(tnk), t is quite high in many cases

Strongly dependent on the initial assignment

do multiple random starts and choose the final estimate with highest likelihood

initialize with clustering algorithms

local maxima and initialization issues have been addressed in EM extensions

Also depend on a proper choice of k

What is the k tradeoff problem in EM model selection?

If k is too high, the mixture may overfit the data

If k is too love, the mixture may not be flexible enough to approximate the data

How can EM model quality be measured?

What is the solution for model selection for determining the parameter k?

What is the idea behind the iterative mode search clustering algorithm?

Find modes in the point density

What is the algorithm behind the iterative mode search clustering algorithm?

Select a window size epsilon and a starting position m

Calculate the mean of all points inside the window W(m)

Shift the window to that position

Repeat until convergence

Apply iterative mode search for each data point. Group those that converge to the same mode (basin of attraction)

What is the disadvantage of the mean shift clustering algorithm?

It has a relatively high complexity

What are the advantages of the mean shift clustering algorithm?

Clusters can have arbitrary shape/size -> no restriction to convex shapes

Number of clusters is determined automatically

Robust to outliers

Easy implementation and parallelization

Single parameter - epsilon

Suppose by spatial indexing to reduce complexity

What is the basic idea behind density-based clustering?

Clusters are dense regions in the data space, separated by regions of lower density

How is the local point density at a point q defined?

By 2 parameters:

epsilon radius for the neighbourhood of point q

N_eps(q) = {p in D | dist(p,q) <= eps}

MinPts: minimum number of points in the given neighbourhood N_eps(q)

In density-based clustering what is meant by a core point, a density-reachable point, and density-connected?

Core point:

q is called a core point w.r.t eps, MinPts if |N_eps(q)| >= MinPts

Density-reachable:

p is directly density-reachable from q w.r.t. eps, MinPts if:

p is in N_eps(q) and

q is a core object w.r.t eps, MinPts

Density-connected:

p is density-connected to a point q w.r.t. eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t eps, MinPts

How are density-based clusters defined? (2 key concepts)

Maximality:

if q is in C and p is density-reachable from q then p is in C

Connectivity:

Each object in C is density-connected to all other objects in C

In density-based clustering, how are clusters and noise defined?

What is the DBSCAN algorithm and what does it stand for?

Density-Based Spatial Clustering of Applications with Noise

What is a heuristic used in determining eps and MinPts in DBSCAN?

Fix a value for MinPts (default 2d-1)

Compute the k-distance for all points p in D (distance from p to its nearest k-neighbour), with k=MinPts

Create a k-distance plot, showing the k-distances of all objects, in decreasing order

The user selects “border object” o from the MinPts-distance plot: eps is set to MinPts-distance(o)

How can databases support DBSCAN?

DBSCAN can be efficiently built on top of similarity join operations

For very large DB, efficient join techniques are available

What are the disadvantages of the DBSCAN algorithm?

Input parameters may be difficult to determine

In some situations very sensitive to input parameters setting

What are the advantages of the DBSCAN algorithm?

Clusters can have arbitrary shape and size -> no restriction to convex shapes

Can separate clusters from surrounding noise

Good complexity O(n^2)

Can be supported by spatial index structures -> O(nlogn)

What are the general steps for spectral clustering?

Construct fully-connected KNN graph with eps neigbourhood

Created adjacency, degree, laplacian matrix

Compute eigenvalues, eigenvalues of L matrix, select top eigenvectors

Apply k-means on k eigenvectors

Map results back to original data

What are the advantages of spectral clustering?

Easy to implement (cap)

No assumptions on the shape of the clusters

What are the disadvantages of spectral clustering?

May be sensitive to the construction of the similarity graph

Slow runtime - k smallest eigenvectors can be computed in O(n^3)

What are the basic notions behind hierarchical clustering?

Hierarchical decomposition of the dataset into a set of nested clusters, given a set of nested clusters

Results represented by dendrogram

Nodes in the dendrogram represent possible clusters

Dendrograms can be constructed bottom up (agglomerative approach) or top down (divisive aproach)

How can one interpret a dendrogram?

The root represents the whole dataset

A lead represents a single object in the data set

An internal node represents the union of all objects in its sub tree

The height of an internal node represents the distance between its two child nodes

What is the generic algorithm behind Agglomerative (Bottom up) Hierarchical Clustering?

Initially each object forms its own cluster

Consider all pairwise distances between initial clusters

Merge the closest pair (A,B) in the set of the current clusters into a new cluster C = A U B

Remove A and B from the set of current clusrters, insert C

If the set of current clusters contains only C, STOP

Else: determine the distance between C and all other clusters in the set of current clusters and go to step 3

What is agglomerative hierarchical clustering and what does it require?

Agglomerative hierarchical clustering begins by treating each individual data point as its own cluster. It then iteratively merges the most similar clusters until all the data points belong to a single, large cluster.

It requires a distance function for clusters

What different types of distance functions are there for clusters?

Single-link: min distance between 2 elements in diff clusters

Complete-link: max distance between 2 elements in diff clusters

Average-link: the average distance between all points in diff clusters

What is the general approach for the top down method of hierarchical clustering?

Initially all objects form one cluster

Repeat until all clusters are singletons:

Choose a cluster to split

Replace the chose cluster with the sub-clusters and split into two

What is the DIANA algorithm?

Top down divisive clustering method

How does Agglomerative vs Divisive hierarchical clustering algorithms compare?

Both need n-1 steps

Agg has to consider (n,2) combinations in the first step

Div has to potentially consider 2^(n-1) -1 possibilities to split the data

Div more complex since it needs a second “flat” clustering algorithm to split

Agg decides based on local patterns

Div uses global data distribution

What is the main idea behind OPTICS, and what does it stand for?

Ordering Points To Identify the Clustering Structure

Density based hierarchical clustering algorithm

Main idea:

Visit each point

Always make a shortest jump

Maintain two data structures

seedList: stores all objects with shortest reachability distance seen so far (‘distance of a jump to that point’) in ascending order; organized as a heap

clusterOrder: resulting cluster order is constructed sequentially (order of objects + reachability distances)

In OPTICS, after the seed list and the reach distances have been recorded, what happens?

Plot the points together with their reachability distances. Use the order in whih they were returned by the algorithm, this plot:

represents the density-based clustering structure

is easy to analyse

independent of the dimensionality of the data

significant valleys represent clusters

What effect does parameter sensitivity have on OPTICS?

Relative insensitivity to parameter settings

Good result if parameters are just “large enough”

What are the disadvantages of hierarchical clustering?

May not scale well

Runtime O(n^2logn^2) for standard methods, O(n^2) for OPTICS

User has to choose final clustering

What are the advantages of hierarchical clustering?

Does not require the number of clusters to be known in advance

No (standard methods) or very robust parameters (OPTICS)

Computes a complete hierarchy of clusters

Good result visualizations integrated into the methods

A “flat” partition can be derived afterwards (via cut through dendrogram or reachability plot)

What is the general approach of the solution to the clustering evaluation problem?

Given a dataset D, a clustering C = C1,…,Ck and a ground truth G = G1,…Gl

Instead of comparing cluster and ground truth labels directly, consider all pairs of objects. Check whether they have the same label in G and if they have the same in C.

In cluster evaluation how are TP, FP, TN, FN formalized?

In cluster evaluation, how are recall, precision, and F1 evaluated?

In cluster evaluation, how are the Rand Index, Adjusted Rand Index, and Jaccard Coefficient evaluated?

In cluster evaluation, how is cohesion measured?

In cluster evaluation, how is separation measured?

What is the problem with the cohesion and separation internal measures?

They are suitable for convex clusters, but not for stretched clusters

Last changed22 days ago