what is kNN
supervised learning techiunque
can be used for classificatrion and regression
can also be used for unsupervised anomaly detection
single hyperparameter k
how does the knn Algo work?
classify new potin (x,y) based on ebighbrohood
-> no trainingphase
-> compute distance to ALL OTHER POINTS
select k nearest instances and assign x_new by majority vote…
Formulat for L_p distance?
given vector pairs
a = [a1, a2, a3, …, an]
b = [b1, b2, b3, …, bn]
L_p =
( Sum over over i
|ai - bi| ^p
)
^ 1/p
=> pth root of sum of betrag ^p
Training and inference complexity of kNN?
training -> O(1) (simply store…)
inference -> O(n) -> calc distance to all of the n datapoints…
How can one use kNN for regression?
each poitn has y
-> for new x, averaget he y values of the k nearest neighbors…
How to use kNN for unsupervised anomaly detection?
given: unlabeled dataset (xi) -> no labels yi
for all points xi -> calculate MEAN distance to nearest k neighbors
define some trheshold C based on this…
=> all xi for which the mean distance is smaller are normal
=> all points for whcih distance is >= C are anomalies…
What is DBSCAN?
density-based spatial clustering of applications with noise
=> unsupervised anomaly detection
=> group points in space together that are closely to each other
-> points in sparse regions are outliers…
How does dbscan work?
given distance epsilon (similar to threeshold C in kNN)
integer m (similar to k in kNN)
core points: at least m other points are in distance epsilon
directly reachable points: within distance epsilon of core point
outliers: points which are not reachable…
=> cluster all data into these groups to find outliers… (which thus are not in these clusters…)
What are advantages and disadvantages of DBSCAN?
Advantages:
no need to specify number of clusters beforehand (as in k meanss)
can find arbitrary non-linearly separable clusters
requires only two parametsr (epsilon, m)
Disadvantages:
not entriely deterministic
border points reachable from two clusters may be part of either cluster…. -> depending of order of data processing…
curse of dimensionality
What are the effects of m and epsilon in dbscan?
m
the higher, the less and small the clusters
-> the more the number of outliers….
epsilon
too smal -> data will not be clustered
too high -> no outliers…
Rule of thumb:
m = 2* dimensionality ( 2 times the number of datapoints…)
What is a good way to estimate epsilon in DBSCAN?
find distance to nearest neighbor for all points and sort
-> plot results -> instances on x, distenaces on y
=> select epsilon where “change” is most pronounced…
What is a problem of methods relying on distance measures?
Ln loss breaks for high dimensional data
->
What solitnois are there for curse of dimensionality?
L1 only dampens but does not remove curse of dimensiontality
not really cosine loss similarity as no triangle inequality…!
parctical solution:
dimensionality reduction via
autoencoders
PCA
feature selection…
Last changed2 years ago