What is k-NN?
supervised learning technique for both
regression
classification
can also be used for unsupervised anomaly detection
a single hyperparameter k -> number of neighbors
What is the goal of k-NN (high level)?
find decision surfaces in space
-> so that new point can be classified
How does the k-NN algo work?
classify new point based on its neoghborhood
-> no training phase
-> (training) datapoints are model themselves
algo:
for new datapoint
compute distance to all other datapoints
select k nearest instances
assign class by majority vote
How to calculate the p-distance?
sum the difference of all vector elements power p
take p-th root of it
-> the higher p, the more it becomes a square
How to adjust variance and bias in k-NN?
high k -> higher bias -> herd
low k -> higher variance -> respect locality
What is the complexity of k-NN?
training O(1) (we only have the datapoints… -> no actual training phase…)
classifying O(n) (compute distance to all n datapoints…)
How can there be a voting tie in uneven k?
2 vs 2 vs 1…
How can we use k-NN for regression?
we have training data with (x1, x2) y
-> classify new instance by averaging over k-nearest points y value…
How can one use k-NN for anomaly detection?
given: dataset without labels (no y value)
for all xi: calculate li: mean distance to nearest k-neighbors
define some threshold C
all xi for which li < C are normal
while all points for which li >= C are anomalous (GREATER EQUAL)
What can be a cause of error in the exam when calculating the distance for anomaly detection?
two points further away still count into the k neighbors
-> i.e. outlying cluster of 4 points would be normal as they have themselves as k near neighbors…
What is DBSCAN?
density based spatial clustering of applications with noise
-> unsupervised anomaly detection technique
What is the general idea of DBSCAN?
group together points that are closely packed
marking points as outliers that lie alone in low-density regions
How does DBSCAN work?
given:
distance epsilon (similar to C -> threshold)
integer m (similar to K)
mark different point classes:
core point
directly reachabnle point
outliers
What are core points in DBSCAN?
point p
where at least m points are within distance epsilon
What are directly reachable points in DBSCAN?
point q is directly reachable
if it is within distance epsilon of core point
What is an outlier in DBSCAN?
points which are not reachable
What are advantages of DBSCAN?
no need to specify number of clusters beforehand (contrary to e.g. k-means)
can find arbitrary, non linearly separable clusters
requires only two parameters
m
epsilon
What are disadvantages of DBSCAN?
not entirely deterministic
border points reachable from two clusters may be part of either cluster
depending on order of data processing
curse of dimensionality
How do the value of DBSCANS parametrs influence the clustering?
m:
the higher
the less and smaller the clusters
the more the outliers
distance epsilon:
if too small
data will not be clustered
if too large
no outliers
How to decide on m in DBSCAN?
rule of thumb:
m = 2*dim
How to decide on epsilon in DBSCAN?
use an algorithm
find distance to nearest neighbor (not neighborsss) for all points and sort
plot results: instances on x, epsilon on y axis
select epsilon where “change is most pronounced”
=> meaning of (x,y) in plot:
i.e. (200, 0.19)
199 points have their nearest neighbors closer than 0.19
How can one create a knee-plot?
calculate distance to m nearest neighbors
sum up and divide by m
-> create list of distances and sort
-> plot with y-axis being the distance
-> x-axis being the instances of the (sorted) list…
-> again, choose point with most pronounced change as good distance choice
What are some effects of the curse of dimensionality in methods such as DBSCAN or k-NN?
increasing sparsity
distance measure loses meaning
increased computational complexity
overfitting
What is the problem of increased sparsity in high dimensions?
volume of space increases exponentially
data points become more spread out
harder to identify meaningful clusters or neighborhoods
What is the problem of distance measure losing meaning in high dimensions? What can one still infer?
in high dimensional spaces
-> distance between pairs of points become increasingly more similar
one can still say in which direction we get closer…
What is the conclusion of curse of dimensionality w.r.t. the usability of distance metrics?
DBSCAN and KNN cannot use L2 (or any Lp) for large d
-> use fractional metrics (0 < L < 1) also not really good
not real metrics, triangle inequality not satisfied
also susceptible to curse of dimensionality
What are solutions to the curse of dimensionality?
L1 dampens effect but does not remove it
cosine similiarty
disadvantage: no triangle inequality
practical solution:
dimensionality reduction
What methods can be used for dimensionality reduction?
autoencoder
PCA
feature selection
Last changed2 years ago