What is the definition of anomaly detection?
identification of rare items
-> which raise suspicion by differeing significantly form the majority of the data…
What are point anomalies?
individuzal data instance
that can be considered as anomalous with respect to rest of the data
What are collective anomalies?
collection of related instances is anomalous with respect to the entire data set
What are contextual anomalies?
data instance is anomalouzs in specific context
but not otherwise
Categorize different anomaly types w.r.t context and amount of data points
How do we have to process features to detect collective anomalies?
aggregate features over time
i.e. sliding window
How can we determine what is normal w..r.t. features?
threshold
choose some distribution and fit distributino to data
use kernel density estimation
What are drawbacks of threshold based modeling?
arbitrary
yields only binary yes / no answer
would rather like probabilitsticc result
What are drawbacks of fitting a probability distribution to the data?
how/which to choose?
often not good fit.,..
What is the difference between a probabilitiy density function, a probability function and a cumulative distribution functino?
PDF:
continuous representation of probabilty variable distribution
integral equals 1
individual probability equals 0
all non-negative
PF:
individual probabilites still 0
P(x) = 0
Integral between two points is smaller equal 1
CDF:
cumulate probabiltiy from right to left until we reach 1…
-> integral P(z<=x) = integral from - infinity to x ; <= 1
What are the two characteristics of KDE?
non parametric
we do not explicitly specify which probablity distribution to use
density estimation
but we still use a probablity distributino (instead of naiver approach such as “remembering” normal)
How can we use kernels to approximate an unknown disrtibution?
draw n univariate (no vectors) samples (independently and identically)
use some kernel function and sum over these kernesl verschoben by the sample values normalized by the number of samples
What is the bandwidht used for in kernels?
to smoothen them
-> large bandwidth -> high degree of smoothing -> potential underfitting
-> small bandwidth -> low degree of smoothing -> jagged -> overfitting
How is the formula for the usual kernel we use (normaldistribution) with and without bandwidht factor?
with bandwidth
without bandwidth
What is the formula for the estimator?
How does cross validation work?
split train data into n parts
use n-1 for training and 1 for validation
change the one for validation each epoch
How can we use cross validation to find the best h?
we have list of hyperparameters h
we have our cross validation splits
for all hyperparameters
for all cross validation splits
fit model on data without current validation split
eval on split
return h where average validation score is best
-> do cross validation for each hyperparameter
-> return the h where we receive the average best validation score (across the different cross validation runs)
What time does KDE with cross validation take?
O(h*k)
-> for all h parameters, run k validaiton runs…
How do we use KDE to evaludat a new instance w.r.t. anomaly?
we put the value in our estimator
if it lies below a threshold C
=> it is an anomaly…
How can one improve visualization of KDE anomaly detection?
plot it on logarithmic scale (natural logarithm ln)
-> as usually threshold is very small…
Last changed2 years ago