What is an outlier?
An outlier is an observation which deviates so much from
the other observations as to arouse suspicions that it was generated by a di↵erent mechanism.”
Statistics-based intuition:
Normal data objects follow a “generating mechanism”, e.g. some given statistical process
Abnormal objects deviate from this generating mechanism
Applications
fraud detection
medicine (abnormal tests …)
Detecting measurement errors
Important Properties of Outlier Models
Systematic Approaches
Find clusters and then assess the degree to which a point belongs to any cluster
E.g. for k-Means, use distance to the centroid
If eliminating a point results in substantial improvement of the objective function, we could classify it as an outlier
Statistical Tests
Normal data objects follow a (known) distribution and occur in a high probability region of this model
Outliers deviate strongly from this distribution (e.g., deviate more than 3 times the standard deviation from the mean)
Statistical Outliers
Statistical Outliers: Problems
Curse of dimensionality: The larger the degree of freedom, the more similar the MDist values for all points
Robustness:
Mean and standard deviation are I very sensitive to outliers (computed on wohle dataet including outliers)
The MDist is used to determine outliers although the MDist values are influenced by these outliers
Data distribution is fixed
Low flexibility (if no mixture models)
Global method
Distance-Based Approaches
Distance-Based Approaches: kNN
kNN: Problems
Density-Based Approaches and problems
Density-Based Approaches - relative density
Consider the relative density w.r.t. to the neighbourhood.
Local Outlier Factor (LOF) of p (avg. ratio of Local Densities(lds) of kNNs of p and ld of p)
Angle-Based Approach
Angles between outliers and random point pairs vary only slightly
ABOD
Angle-based Outlier Detection
Variance weighted by the corresponding distances
Small ABOD = outlier
Tree-Based Approaches: Isolation Forest
Outlierness = how easy it is to separate a point from the rest by random space splitting?
Only detects global outliers
Define the anomaly score of a point x as s(x) ([0,1]
Usually, set s = 0.5 as threshold, i.e. the average of the expected path length
Isolation Tree - Training
Isolation Forest - Training
Last changed2 years ago