05_Clustering

Buffl

KI Fahrzeugtechni

by Jensen J.

What is the essence of clustering?

deals with identifying similarities and differences between different data points
and dividing data into groups

Contrast classification and clustering?

clustering usually starts with all data present (without label)
-> then divides data into differetn groups…
=> does not result in model but clustering of data…

What is an advantageous property of clustering for data processing?

can be used for pre-processing (e.g. cluster lidar points) to deal with individual clusters rather than all available points…

What is point-cloud clustering used for?

take lidar points
-> cluster them
-> create hulls representing surrounding obejcts / bodies…

What else can we cluster w.r.t. e.g. autonomous driving?

camera images
-> segmentation…

Difference clustering and segmentation?

both terms interchangeable…
-> statistical background: clusetering
business background: segment
clustering produces segments and vice versa…

What is the formal definition of clustering?

elements e ∈ E
cluster c ∈ C with c ⊆ E
- Union U_{c ∈ C} = E
- Intersectoin ⋂_{c ∈ C} = ∅
- => we have datapoints e
- => we have clusters that are subset of these elements
- => the union of all cluseters results in the set of datapoints
- => the intersection of clusters yields the empty set (datapoints can only belong to a single cluster…)
Representative r_{c} = mean(c)
- Cluster Center
Variablility(c) = ∑_{e∈C} distance(r_{c}, e)^2
- => Sum over the squared distances of all datapoints in cluseter to the cluster center…
- -> how distributed are the points from the center in each cluster…

What distance types are there in clustering?

Manhattan
- |x1 - x2| + |y1 - y2|
  - => sum of absolute x and y distances…
Euclidian
- sqrt((x1 - x2)^2 + (y1 - y2)^2)
  - -> basically the true distance
cebyshev
- max(|x1 - x2|, |y1 - y2|)
  - maximum (x distance, y distance)

Whta to note about the distance w.r.t. importancen and on which data applied?

most important metric in clustering
-> not only spatial distance but also any numerically representable property (e.g. color, size, weight,…)

What to do in clustering?

grouping a set of data objects into clusters

-> cluster: a collection of elements

-> similar to one another within the same cluster

-> dissimilar to the objects in other clusters

Difference clustering classification?

no classes (labels) given
unsupervised learning

Applicaiotn of clustering?

get insights in large datasets
pre-processing for other algorithms

What are quality measurements of clusters?

distance to representatives
similarity
dissimilarity
silhouette coefficient

What is the distance to representatives?

highly dependent on number of clusters
e.g. only two clusters -> high distances
almost as many clusters as data points -> small distances…

What is the simiilairty in clustering?

individual datapoint based
average distance to all elements within same cluster

What is the dissimiliarty in clustering

again individual element based metric
average distance to all elements of second closest cluster
-> thus, like similarity but to closest other cluster…

Again immage similarity and dissimilarity

What is the shilouette coefficient?

How does hierarchical clustering work?

start with one cluster per element
combine two closest (most similar) clusters
repeat until all elements are in one cluster…

What types of distances betwee clusters exist?

single link
- smallest distance between two point of differetn clusters (distance between nearest points of clusters…)
complete link
- largest distance between two points of different clusters
average link
- average distance between all points of one cluster to all points of a differetn cluster
- => basically calcualte distances to all point combinations and divide by number of combinations…

What is a dendogram?

“map” to represent clustering…
root: cluster with all points
leaf: cluster with one point
edges: combine two clusters
depth: distance between two combined clusters

=> gives insight into distribution of datapoints in space…

How can one use the shilouette coefficient graphically for clustering?

see positive and negative …
=> e.g. negative values not reaonale -> similar to second cluster as well….
=> e.g. happens when evenly distributed stuff tried to cluster in two clusters…

What can cause problems with regular hierarchical clustering?

clusters within clusters
variable densities within dataset

=> idea

cluster hierarchies
noise filtering

Pros hierarchical clustering?

generic
- no cluste rnumber or other parameter must be defined
visualization
- dendogram shows hierarchy
heirarchy:
- relationship between clusters
deterministic:
- generates always same clusters

What are cons of hierarchical clustering?

scalability
- runtime: O(n^3)
choice:
- final clustering result (number of clusters) must be chosen from hierarchy
- => by cutting dendogram at some depth level…

What is the basic idea of k-means?

given desired number of clusters
-> minimize cluser variability…

=> larege sum of cluster variablilites:

poor clustering
or bad choice for k
e.g. choose a single cluster…

=> minimal sum of cluster variabiliteis

optimal clustering
choose one cluster for each dense region…

How does the k-means algorhtim work (Lloyd)?

input:

number of desired clusters k
dataset

initializatoin:

choose k arbitrary representatives

repeat until stable

assign objects to nearest representatives
compute center of each cluster as new representative (compute new, do not use existing datapoint…)

How to choose k?

a priori knowlege of an expert
- e.g. “there are five different types of bacteria” -> k = 5
search for a good k
- naive approach: brute force, iterating k = 2,3,4,5,…
- run hierarchical clustering on subset of data beforehand…

How to do bruteforce for finding k?

calculate average silhouette coefficietn for each k
-> choose k with best value…
(can e.g. also see negative coefficients -> bad clustering…)

What is a good goal of sihlouette coefficient?

How are initial representatives chosen?

randomly

What is a problem with randomly choosing the initial representatives?

k-means sensitive to initialization
-> bad initialization can cause bad results
=> not really good to initialize randomly…

How can one handle the initialization problem? (naive approach) Problem?

naive approach:

use small random subset D from E
cluster D and user found representatives in intialization
still arbitrary choice could yield bad result…

How can one handle randomness initialization problem (improved approach)?

given dataset E, get m small random subsets
cluster the subsets and save representatives of each cluster
cluster the merged set of the subsets
- m times with the different representatives from each subset
select representation from the different representatives which yielded best result on the previous step and use for whole dataset

What does k-means yield w.r.t. cluster shape?

convex clusters
-> compare to vronoi model

What is the vronoi model?

partitions space in convex vronoi cells for each point
=> vronoi cell for a point covers area which is nearest to this data point
=> vronoi is result from k-means for its representatives…

What are pros of k-means?

efficiency:
- O(tkn)
  - n = #objects
  - k = #clusters
  - t = #iterations
- implementation: easy to use

What are cons of k-means?

applicabilty: mean must exist
noise: sensitive to outliers
specification: k must be defined
initialization: might run into local optimum
cluster form: convex space partitions
- no other space shape found…

What are some variants to k-means? When is it useful?

k medoios
k-median clustering

=> different center than mean -> useful when mean not exists (which is not always the case…)

How does k-medoids vary from k-means?

medoid: representative is not point in space but actual datapoint that is closest to that point…

How does k-median differ?

use the median of (sorted) cluster

=> half on left side, half on right side…

-> less sensitive to outliers

How can one reduce the influcence of outlers in k-means?

influence increases if squared distance is used -> use normal distance…

How does k-menas, k-medoid and k-median compare?

What are overall pros of the k-type clusterign methdos?

implementation: easy to use

What are the overall cons of k-type clustering?=

specification: k must be defined
cluster form: convex space partitions
initialization: might run into local optimum

What is DBSCAN?

density based clustering application with noise

How many parameters does DBSCAN have?

two
epsilon-radius neighborhood
minimum points (MP)

What point classes are there in DBSCAN?

core
border
outlier

How does DBSCAN work (roughly)?

specify epsilon radius neighborhood
specify MP
assign each data point a class

Rules DBSCAN?

if there are at least MP number of datapoints in the epsiloin-radious neighborhood
- -> core point
if there are less points than MP in e-radius but a core is there
- -> border
if less than MP in neighborhood and no core in neighborhood
- -> outlier

!! count the point in question to the number of MP…

How do we create clusters from DBSCAN?

all points in cluster are reachable from eachother
-> meaning: we can create path from each point that has only cores in between…
=> endpoints can be border…
=> reachable: epsilon-neighborhood (else: not assigned as border…)

What are pros of DBSCAN?

cluster form: arbitrary (convex and non-convex) space partitions
specification: k determined automatically
noise: separates clusters from noise
efficiency O(n^2)

What are cons of DBSCAN?

specification: parameters difficult to determine
sensitivity: very sensitive to parameter changes

What are applications of clustering?

news articles based on key words?
genome patterns -> groups of people
computing clusters (which to put close together to compute same thing)
clustering people on social media
online shopping -> cluster to segment customers and provide customized product suggestions
netflix -> movie suggestions

What are applications of clustering in automotive (city planning)

traffic analysis
- collect mobility data of cars or density of certain reginos
- use cluseter algos to identify different gorups
  - e.g. commuters, points of interests
- extract generalization of trajectories and traffic flow
- use knowledge for city planning and to identify bottlenecks