6. Distance Based Learners

Buffl

ML ITSecv

von Jensen J.

What is k-NN?

supervised learning technique for both
- regression
- classification
can also be used for unsupervised anomaly detection
a single hyperparameter k -> number of neighbors

What is the goal of k-NN (high level)?

find decision surfaces in space
-> so that new point can be classified

How does the k-NN algo work?

classify new point based on its neoghborhood
-> no training phase
-> (training) datapoints are model themselves

algo:

for new datapoint
- compute distance to all other datapoints
- select k nearest instances
- assign class by majority vote

How to calculate the p-distance?

sum the difference of all vector elements power p
take p-th root of it
-> the higher p, the more it becomes a square

How to adjust variance and bias in k-NN?

high k -> higher bias -> herd
low k -> higher variance -> respect locality

What is the complexity of k-NN?

training O(1) (we only have the datapoints… -> no actual training phase…)
classifying O(n) (compute distance to all n datapoints…)

How can there be a voting tie in uneven k?

2 vs 2 vs 1…

How can we use k-NN for regression?

we have training data with (x1, x2) y
-> classify new instance by averaging over k-nearest points y value…

How can one use k-NN for anomaly detection?

given: dataset without labels (no y value)
for all xi: calculate li: mean distance to nearest k-neighbors
define some threshold C
- all xi for which li < C are normal
- while all points for which li >= C are anomalous (GREATER EQUAL)

What can be a cause of error in the exam when calculating the distance for anomaly detection?

two points further away still count into the k neighbors
-> i.e. outlying cluster of 4 points would be normal as they have themselves as k near neighbors…

What is DBSCAN?

density based spatial clustering of applications with noise
-> unsupervised anomaly detection technique

What is the general idea of DBSCAN?

group together points that are closely packed
marking points as outliers that lie alone in low-density regions

How does DBSCAN work?

given:
- distance epsilon (similar to C -> threshold)
- integer m (similar to K)

mark different point classes:

core point
directly reachabnle point
outliers

What are core points in DBSCAN?

point p
where at least m points are within distance epsilon

What are directly reachable points in DBSCAN?

point q is directly reachable
if it is within distance epsilon of core point

What is an outlier in DBSCAN?

points which are not reachable

What are advantages of DBSCAN?

no need to specify number of clusters beforehand (contrary to e.g. k-means)
can find arbitrary, non linearly separable clusters
requires only two parameters
- m
- epsilon

What are disadvantages of DBSCAN?

not entirely deterministic
- border points reachable from two clusters may be part of either cluster
- depending on order of data processing
curse of dimensionality

How do the value of DBSCANS parametrs influence the clustering?

m:
- the higher
  - the less and smaller the clusters
  - the more the outliers
distance epsilon:
- if too small
  - data will not be clustered
- if too large
  - no outliers

How to decide on m in DBSCAN?

rule of thumb:
- m = 2*dim

How to decide on epsilon in DBSCAN?

use an algorithm

find distance to nearest neighbor (not neighborsss) for all points and sort
plot results: instances on x, epsilon on y axis
select epsilon where “change is most pronounced”

=> meaning of (x,y) in plot:

i.e. (200, 0.19)
- 199 points have their nearest neighbors closer than 0.19

How can one create a knee-plot?

calculate distance to m nearest neighbors
sum up and divide by m
-> create list of distances and sort
-> plot with y-axis being the distance
-> x-axis being the instances of the (sorted) list…
-> again, choose point with most pronounced change as good distance choice

What are some effects of the curse of dimensionality in methods such as DBSCAN or k-NN?

increasing sparsity
distance measure loses meaning
increased computational complexity
overfitting

What is the problem of increased sparsity in high dimensions?

volume of space increases exponentially
data points become more spread out
harder to identify meaningful clusters or neighborhoods

What is the problem of distance measure losing meaning in high dimensions? What can one still infer?

in high dimensional spaces
-> distance between pairs of points become increasingly more similar
one can still say in which direction we get closer…

What is the conclusion of curse of dimensionality w.r.t. the usability of distance metrics?

DBSCAN and KNN cannot use L2 (or any Lp) for large d
-> use fractional metrics (0 < L < 1) also not really good
- not real metrics, triangle inequality not satisfied
- also susceptible to curse of dimensionality

What are solutions to the curse of dimensionality?

L1 dampens effect but does not remove it
cosine similiarty
- disadvantage: no triangle inequality
practical solution:
- dimensionality reduction

What methods can be used for dimensionality reduction?

autoencoder
PCA
feature selection

Beitreten

Vorschau

Author

Jensen J.

Informationen

Zuletzt geändert
vor 2 Jahren

Kurs melden