undefined

Buffl

KDD

von Nicholas L.

What is the difference between clustering and classification?

In clustering the class labels of training data are unknown
Given a set of measurements, observations with the aim of establishing the existence of classes or clusters in the data
Classification is supervision for training data are accompanied by labels indicating the class of the observations
- Classes are known in advance
Data is classified based on information extracted from the training set

What is clustering?

Grouping a set of data objects into clusters which are collections of data objects
- Similar to one another within the same cluster
- This is similar to the objects in other clusters

What are typical usages of clustering?

As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms

What are general applications of clustering?

Preprocessing - as a data reduction
Pattern recognition
Image processing
Spatial data analysis (creating thematic maps)
Business intelligence
Biology e.g. clustering of gene expression data

What are major clustering approaches?

Partitioning algorithms
Probabilistic Model-Based Clustering
Density-based (connectivity and density functions)
Hierarchical algorithms

What is the basic concept of partitioning algorithms?

What is the popular heurisitc method for partitioning?

Choose k representatives for clusters e.g. randomly
Improve these representatives randomly iteratively:
- assign each object to the cluster it “fits best” in the current clustering
- Compute new cluster representatives based on these assignments
- Repeat until the change in the objective function from one iteration to the next drops below a threshold

In k-means what is each cluster represented by? How is this different to k-medoid?

By the center of the cluster in k-means
By one of its objects in k-medoid

What is the basic idea/ objective behind k-means Clustering?

Idea

Find a clustering such that the intra-cluster variation of each cluster is small and use the center of the centroid as representative

Objective

For a given k, form k groups so that the sum of the (squared) distances between the mean of the groups and their elements is minimal

How are objects and the measure for compactness in k-means Clustering mathematically expressed?

Objects p = (p1,…,pd) are points in a d-dimensional vector space

Why are efficient heuristic algorithms needed for k-means Clustering?

Optimal partitioning: argmin SSE(C) is NP-hard

What is Lloyd’s algorithm?

What is a Voronoi Diagram?

For a given set of points P = p1,…,pk, a Voronoi diagram partitions the data space into Voronoi cells, one cell per point
The cell of a point p covers all data points in the data space for which p is the nearest neighbor among the points from P

What are properties of Voronoi cells?

The Voronoi cells of two neighbouring points pi, pj are separated by the perpendicular hyperplane between pi and pj
Voronoi cells are intersections of half spaces and this convex regions

What is the strength of k-Means?

Relatively efficient: O(tkn
- n: # objects
- k: # clusters
- t # iterations
Easy implementation

What are the weaknesses of k-means?

Applicable only when the mean is defined
Need to specify k, the number of clusters, in advance
Sensitive to noisy data and outliers
Clusters are forced to convex space partitions (Voronoi cells)
Result and runtime strongly depend on the intial partition; often terminates at a local optmum: however methods for a good intialization exist

What are alternatives to mean representatives of k-means? Why have alternatives?

Median: (artificial) representative object “in the middle”
Mode: value that appears most often
Medoid: representative object “in the middle”

Find k representatives so that the sum of total distances (TD) between objects and their closest representative is minimal (more robust against outliers)

What is the idea behind k-median?

If there is an ordering on the data use median instead of mean
Compute median separately per dimension

What is the naive partitioning clustering method?

Choose sample A of the dataset
Cluster A and use centers as initialization

What is the k-means++ initialization clustering method?

Select first center uniformly at random
Choose next point with probability proportional to the squared distance to the nearest center already chosen
Repeat until k centers have been selected
Guarantees an approximation ratio of O(logk)

What is the general method of initializing clustering method partitioning?

Repeat with different initial centers and choose result with lowest clustering error

What is the Silhoutte-Coefficient?

Quality measure for k-means or k-medoid clusterings that is not monotonic over k
Monotonic means that an increase in k does not guarantee an increase or decrease in the quality of the clustering

What is the basic idea behind the Silhouette Coefficient?

How good is the clustering = how appropriate is the mapping of objects to clusters
Elements in cluster should be “similar to their representative
- measure the average distance of objects to their representative
Elements in different clusters should be “dissimilar
- measure the average distance of objects to alternative clusters (i.e. second close cluster)

Silhoutte Coefficient mathematical expression

How can one interpret the Silhoutte Coefficient?

What is expectation maximization?

Statistical approach for finding maximum likelihood estimates of parameters in probabilistic models
EM can be used as a clustering algorithm

What is the main idea behind using EM as a clustering method?

Define clusters as probability distributions -> each object has a certain probability of belonging to each cluster
Iteratively improve the parameters of each distribution until some qiu

What is a Gaussian distribution?

How is EM clustering represented?

What is the goal of the maximum likelihood estimator (MLE)?

What is the mathematical approach to MLE?

What are the problems of finding the optimal parameters for MLE?

Non-linear mutual dependencies
Optimizing the Gaussian of cluster j depends on all other Gaussians
There is no closed-form solution
Approximation through iterative optimization procedures
Break the mutual dependencies by optimizing the cluster means and s.d. independently

What does the iterative optimization of EM look like?

What does EM clustering achieve?

EM obtains a soft clustering (each object belongs to each cluster with a certain probability) reflecting the uncertainty of the most appropriate assignment
It is modified to obtain a partitioning variant: assign each object to the cluster to which is belongs with the highest probability

How does EM compare to k-Means?

Superior to k-Means for clusters of varying size or clusters having differing variances
- more accurate data representation

What is the result and runtime of EM clustering dependent on?

Computational effort for t iterations = O(tnk), t is quite high in many cases
Strongly dependent on the initial assignment
- do multiple random starts and choose the final estimate with highest likelihood
- initialize with clustering algorithms
- local maxima and initialization issues have been addressed in EM extensions
Also depend on a proper choice of k

What is the k tradeoff problem in EM model selection?

If k is too high, the mixture may overfit the data
If k is too love, the mixture may not be flexible enough to approximate the data

How can EM model quality be measured?

What is the solution for model selection for determining the parameter k?

What is the idea behind the iterative mode search clustering algorithm?

Find modes in the point density

What is the algorithm behind the iterative mode search clustering algorithm?

Select a window size epsilon and a starting position m
Calculate the mean of all points inside the window W(m)
Shift the window to that position
Repeat until convergence

Apply iterative mode search for each data point. Group those that converge to the same mode (basin of attraction)

What is the disadvantage of the mean shift clustering algorithm?

It has a relatively high complexity

What are the advantages of the mean shift clustering algorithm?

Clusters can have arbitrary shape/size -> no restriction to convex shapes
Number of clusters is determined automatically
Robust to outliers
Easy implementation and parallelization
Single parameter - epsilon
Suppose by spatial indexing to reduce complexity

What is the basic idea behind density-based clustering?

Clusters are dense regions in the data space, separated by regions of lower density

How is the local point density at a point q defined?

By 2 parameters:

epsilon radius for the neighbourhood of point q
- N_eps(q) = {p in D | dist(p,q) <= eps}
MinPts: minimum number of points in the given neighbourhood N_eps(q)

In density-based clustering what is meant by a core point, a density-reachable point, and density-connected?

Core point:

q is called a core point w.r.t eps, MinPts if |N_eps(q)| >= MinPts

Density-reachable:

p is directly density-reachable from q w.r.t. eps, MinPts if:

p is in N_eps(q) and
q is a core object w.r.t eps, MinPts

Density-connected:

p is density-connected to a point q w.r.t. eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t eps, MinPts

How are density-based clusters defined? (2 key concepts)

Maximality:
- if q is in C and p is density-reachable from q then p is in C
Connectivity:
- Each object in C is density-connected to all other objects in C

In density-based clustering, how are clusters and noise defined?

What is the DBSCAN algorithm and what does it stand for?

Density-Based Spatial Clustering of Applications with Noise

What is a heuristic used in determining eps and MinPts in DBSCAN?

Fix a value for MinPts (default 2d-1)
Compute the k-distance for all points p in D (distance from p to its nearest k-neighbour), with k=MinPts
Create a k-distance plot, showing the k-distances of all objects, in decreasing order
The user selects “border object” o from the MinPts-distance plot: eps is set to MinPts-distance(o)

How can databases support DBSCAN?

DBSCAN can be efficiently built on top of similarity join operations
For very large DB, efficient join techniques are available

What are the disadvantages of the DBSCAN algorithm?

Input parameters may be difficult to determine
In some situations very sensitive to input parameters setting

What are the advantages of the DBSCAN algorithm?

Clusters can have arbitrary shape and size -> no restriction to convex shapes
Number of clusters is determined automatically
Can separate clusters from surrounding noise
Good complexity O(n^2)
Can be supported by spatial index structures -> O(nlogn)

What are the general steps for spectral clustering?

Construct fully-connected KNN graph with eps neigbourhood
Created adjacency, degree, laplacian matrix
Compute eigenvalues, eigenvalues of L matrix, select top eigenvectors
Apply k-means on k eigenvectors
Map results back to original data

What are the advantages of spectral clustering?

Easy to implement (cap)
No assumptions on the shape of the clusters

What are the disadvantages of spectral clustering?

May be sensitive to the construction of the similarity graph
Slow runtime - k smallest eigenvectors can be computed in O(n^3)

What are the basic notions behind hierarchical clustering?

Hierarchical decomposition of the dataset into a set of nested clusters, given a set of nested clusters
Results represented by dendrogram
- Nodes in the dendrogram represent possible clusters
- Dendrograms can be constructed bottom up (agglomerative approach) or top down (divisive aproach)

How can one interpret a dendrogram?

The root represents the whole dataset
A lead represents a single object in the data set
An internal node represents the union of all objects in its sub tree
The height of an internal node represents the distance between its two child nodes

What is the generic algorithm behind Agglomerative (Bottom up) Hierarchical Clustering?

Initially each object forms its own cluster
Consider all pairwise distances between initial clusters
Merge the closest pair (A,B) in the set of the current clusters into a new cluster C = A U B
Remove A and B from the set of current clusrters, insert C
If the set of current clusters contains only C, STOP
Else: determine the distance between C and all other clusters in the set of current clusters and go to step 3

What is agglomerative hierarchical clustering and what does it require?

Agglomerative hierarchical clustering begins by treating each individual data point as its own cluster. It then iteratively merges the most similar clusters until all the data points belong to a single, large cluster.
It requires a distance function for clusters

What different types of distance functions are there for clusters?

Single-link: min distance between 2 elements in diff clusters
Complete-link: max distance between 2 elements in diff clusters
Average-link: the average distance between all points in diff clusters

What is the general approach for the top down method of hierarchical clustering?

Initially all objects form one cluster
Repeat until all clusters are singletons:
- Choose a cluster to split
- Replace the chose cluster with the sub-clusters and split into two

What is the DIANA algorithm?

Top down divisive clustering method

How does Agglomerative vs Divisive hierarchical clustering algorithms compare?

Both need n-1 steps
- Agg has to consider (n,2) combinations in the first step
- Div has to potentially consider 2^(n-1) -1 possibilities to split the data
Div more complex since it needs a second “flat” clustering algorithm to split
Agg decides based on local patterns
Div uses global data distribution

What is the main idea behind OPTICS, and what does it stand for?

Ordering Points To Identify the Clustering Structure

Density based hierarchical clustering algorithm

Main idea:

Visit each point
- Always make a shortest jump
Maintain two data structures
- seedList: stores all objects with shortest reachability distance seen so far (‘distance of a jump to that point’) in ascending order; organized as a heap
- clusterOrder: resulting cluster order is constructed sequentially (order of objects + reachability distances)

In OPTICS, after the seed list and the reach distances have been recorded, what happens?

Plot the points together with their reachability distances. Use the order in whih they were returned by the algorithm, this plot:
- represents the density-based clustering structure
- is easy to analyse
- independent of the dimensionality of the data
- significant valleys represent clusters

What effect does parameter sensitivity have on OPTICS?

Relative insensitivity to parameter settings
Good result if parameters are just “large enough”

What are the disadvantages of hierarchical clustering?

May not scale well
- Runtime O(n^2logn^2) for standard methods, O(n^2) for OPTICS
User has to choose final clustering

What are the advantages of hierarchical clustering?

Does not require the number of clusters to be known in advance
No (standard methods) or very robust parameters (OPTICS)
Computes a complete hierarchy of clusters
Good result visualizations integrated into the methods
A “flat” partition can be derived afterwards (via cut through dendrogram or reachability plot)

What is the general approach of the solution to the clustering evaluation problem?

Given a dataset D, a clustering C = C1,…,Ck and a ground truth G = G1,…Gl
Instead of comparing cluster and ground truth labels directly, consider all pairs of objects. Check whether they have the same label in G and if they have the same in C.

In cluster evaluation how are TP, FP, TN, FN formalized?