What is the essence of clustering?
deals with identifying similarities and differences between different data points
and dividing data into groups
Contrast classification and clustering?
clustering usually starts with all data present (without label)
-> then divides data into differetn groups…
=> does not result in model but clustering of data…
What is an advantageous property of clustering for data processing?
can be used for pre-processing (e.g. cluster lidar points) to deal with individual clusters rather than all available points…
What is point-cloud clustering used for?
take lidar points
-> cluster them
-> create hulls representing surrounding obejcts / bodies…
What else can we cluster w.r.t. e.g. autonomous driving?
camera images
-> segmentation…
Difference clustering and segmentation?
both terms interchangeable…
-> statistical background: clusetering
business background: segment
clustering produces segments and vice versa…
What is the formal definition of clustering?
elements e ∈ E
cluster c ∈ C with c ⊆ E
Union U_{c ∈ C} = E
Intersectoin ⋂_{c ∈ C} = ∅
=> we have datapoints e
=> we have clusters that are subset of these elements
=> the union of all cluseters results in the set of datapoints
=> the intersection of clusters yields the empty set (datapoints can only belong to a single cluster…)
Representative r_{c} = mean(c)
Cluster Center
Variablility(c) = ∑_{e∈C} distance(r_{c}, e)^2
=> Sum over the squared distances of all datapoints in cluseter to the cluster center…
-> how distributed are the points from the center in each cluster…
What distance types are there in clustering?
Manhattan
|x1 - x2| + |y1 - y2|
=> sum of absolute x and y distances…
Euclidian
sqrt((x1 - x2)^2 + (y1 - y2)^2)
-> basically the true distance
cebyshev
max(|x1 - x2|, |y1 - y2|)
maximum (x distance, y distance)
Whta to note about the distance w.r.t. importancen and on which data applied?
most important metric in clustering
-> not only spatial distance but also any numerically representable property (e.g. color, size, weight,…)
What to do in clustering?
grouping a set of data objects into clusters
-> cluster: a collection of elements
-> similar to one another within the same cluster
-> dissimilar to the objects in other clusters
Difference clustering classification?
no classes (labels) given
unsupervised learning
Applicaiotn of clustering?
get insights in large datasets
pre-processing for other algorithms
What are quality measurements of clusters?
distance to representatives
similarity
dissimilarity
silhouette coefficient
What is the distance to representatives?
highly dependent on number of clusters
e.g. only two clusters -> high distances
almost as many clusters as data points -> small distances…
What is the simiilairty in clustering?
individual datapoint based
average distance to all elements within same cluster
What is the dissimiliarty in clustering
again individual element based metric
average distance to all elements of second closest cluster
-> thus, like similarity but to closest other cluster…
Again immage similarity and dissimilarity
What is the shilouette coefficient?
How does hierarchical clustering work?
start with one cluster per element
combine two closest (most similar) clusters
repeat until all elements are in one cluster…
What types of distances betwee clusters exist?
single link
smallest distance between two point of differetn clusters (distance between nearest points of clusters…)
complete link
largest distance between two points of different clusters
average link
average distance between all points of one cluster to all points of a differetn cluster
=> basically calcualte distances to all point combinations and divide by number of combinations…
What is a dendogram?
“map” to represent clustering…
root: cluster with all points
leaf: cluster with one point
edges: combine two clusters
depth: distance between two combined clusters
=> gives insight into distribution of datapoints in space…
How can one use the shilouette coefficient graphically for clustering?
see positive and negative …
=> e.g. negative values not reaonale -> similar to second cluster as well….
=> e.g. happens when evenly distributed stuff tried to cluster in two clusters…
What can cause problems with regular hierarchical clustering?
clusters within clusters
variable densities within dataset
=> idea
cluster hierarchies
noise filtering
Pros hierarchical clustering?
generic
no cluste rnumber or other parameter must be defined
visualization
dendogram shows hierarchy
heirarchy:
relationship between clusters
deterministic:
generates always same clusters
What are cons of hierarchical clustering?
scalability
runtime: O(n^3)
choice:
final clustering result (number of clusters) must be chosen from hierarchy
=> by cutting dendogram at some depth level…
What is the basic idea of k-means?
given desired number of clusters
-> minimize cluser variability…
=> larege sum of cluster variablilites:
poor clustering
or bad choice for k
e.g. choose a single cluster…
=> minimal sum of cluster variabiliteis
optimal clustering
choose one cluster for each dense region…
How does the k-means algorhtim work (Lloyd)?
input:
number of desired clusters k
dataset
initializatoin:
choose k arbitrary representatives
repeat until stable
assign objects to nearest representatives
compute center of each cluster as new representative (compute new, do not use existing datapoint…)
How to choose k?
a priori knowlege of an expert
e.g. “there are five different types of bacteria” -> k = 5
search for a good k
naive approach: brute force, iterating k = 2,3,4,5,…
run hierarchical clustering on subset of data beforehand…
How to do bruteforce for finding k?
calculate average silhouette coefficietn for each k
-> choose k with best value…
(can e.g. also see negative coefficients -> bad clustering…)
What is a good goal of sihlouette coefficient?
0.7
How are initial representatives chosen?
randomly
What is a problem with randomly choosing the initial representatives?
k-means sensitive to initialization
-> bad initialization can cause bad results
=> not really good to initialize randomly…
How can one handle the initialization problem? (naive approach) Problem?
naive approach:
use small random subset D from E
cluster D and user found representatives in intialization
still arbitrary choice could yield bad result…
How can one handle randomness initialization problem (improved approach)?
given dataset E, get m small random subsets
cluster the subsets and save representatives of each cluster
cluster the merged set of the subsets
m times with the different representatives from each subset
select representation from the different representatives which yielded best result on the previous step and use for whole dataset
What does k-means yield w.r.t. cluster shape?
convex clusters
-> compare to vronoi model
What is the vronoi model?
partitions space in convex vronoi cells for each point
=> vronoi cell for a point covers area which is nearest to this data point
=> vronoi is result from k-means for its representatives…
What are pros of k-means?
efficiency:
O(tkn)
n = #objects
k = #clusters
t = #iterations
implementation: easy to use
What are cons of k-means?
applicabilty: mean must exist
noise: sensitive to outliers
specification: k must be defined
initialization: might run into local optimum
cluster form: convex space partitions
no other space shape found…
What are some variants to k-means? When is it useful?
k medoios
k-median clustering
=> different center than mean -> useful when mean not exists (which is not always the case…)
How does k-medoids vary from k-means?
medoid: representative is not point in space but actual datapoint that is closest to that point…
How does k-median differ?
use the median of (sorted) cluster
=> half on left side, half on right side…
-> less sensitive to outliers
How can one reduce the influcence of outlers in k-means?
influence increases if squared distance is used -> use normal distance…
How does k-menas, k-medoid and k-median compare?
What are overall pros of the k-type clusterign methdos?
What are the overall cons of k-type clustering?=
What is DBSCAN?
density based clustering application with noise
How many parameters does DBSCAN have?
two
epsilon-radius neighborhood
minimum points (MP)
What point classes are there in DBSCAN?
core
border
outlier
How does DBSCAN work (roughly)?
specify epsilon radius neighborhood
specify MP
assign each data point a class
Rules DBSCAN?
if there are at least MP number of datapoints in the epsiloin-radious neighborhood
-> core point
if there are less points than MP in e-radius but a core is there
-> border
if less than MP in neighborhood and no core in neighborhood
-> outlier
!! count the point in question to the number of MP…
How do we create clusters from DBSCAN?
all points in cluster are reachable from eachother
-> meaning: we can create path from each point that has only cores in between…
=> endpoints can be border…
=> reachable: epsilon-neighborhood (else: not assigned as border…)
What are pros of DBSCAN?
cluster form: arbitrary (convex and non-convex) space partitions
specification: k determined automatically
noise: separates clusters from noise
efficiency O(n^2)
What are cons of DBSCAN?
specification: parameters difficult to determine
sensitivity: very sensitive to parameter changes
What are applications of clustering?
news articles based on key words?
genome patterns -> groups of people
computing clusters (which to put close together to compute same thing)
clustering people on social media
online shopping -> cluster to segment customers and provide customized product suggestions
netflix -> movie suggestions
What are applications of clustering in automotive (city planning)
traffic analysis
collect mobility data of cars or density of certain reginos
use cluseter algos to identify different gorups
e.g. commuters, points of interests
extract generalization of trajectories and traffic flow
use knowledge for city planning and to identify bottlenecks
What are applications of clustering in automotive (image)
cluster lidar datapoints
classify clustered object from image
=> high level fusion: fuse both to segment image
=> yields object list (position, velocity, classification=)
What is low-level fusion?=
overlay lidar and camera image directly
find clusters in augmented point-cloud
object detecion and classification based on fused raw data
Image high level fusion?
Image low level fusion
What is the object detecion and classification pipeline?
raw imate -> pixel segmentation
raw point cloud
=> fusion
-> detection & classification
=> give clusters meaning afterwards
What can bad initialization in k-measn lead to?
bad clustering
bad convergence speed
=> sensitive to initialization
Time complexity DBSCAN?
O(n log n)
Zuletzt geändertvor 2 Jahren