Classsifier - Formal setup, examples
predict categorical class label
examples
Bayes classifier
Linear classifier
decision tree
k-NN classifier
Numerical Prediction
Regression (analysis)
Determine the numerical value of an object
models continuous-valued functions
Quality Measures for Classifiers
Classifier - Accuracy and Error
For a data set with classes of di↵erent size accuracy is misleading.—> need other measurements
External Measures
for a asingle class
Cross Validation
Overfitting
The classifier adapts too closely to the training dataset and may therefore fail to accurately predict class labels for test objects unseen during training.
Underfitting
Occurs when the classifiers model is too simple, e.g. trying to separate classes linearly that can only be separated by a quadratic surface
Bayes Classification
Key theorem
Naive Bayes Classifier
Independence Assumption
For any given class cj the attribute values xi are distributed independently
Computationally easy for categorical and contonous attributes
Bayesian Classifiers: Discussion
Linear Discriminant Function Classifier
Separate points of two classes by a hyperplane
I.e., classification model is a hyperplane
Points of one class in one half space, points of second class in the other half space
multi class extention
Algebraic notions for vector space
Discussion (Linear Discriminant Function)
support vector machines
with: xi = datapoints, +1: yi = +1, if first class, else yi = 1., normal vector w of unit length , b distance from origin
Soft Margin Optimization
SVM: Discussion
Non-Linearly Separable Data Sets
Linear classifier//SVM can be easily extended to non-linear space
—> Extension of Linear Discriminant Function Classifierc
Extension of the Hypotheses Space
Parabola Example
Circular Example
-
Non-linear Classification: Discussion
Kernel Trick
Kernel Function
Implicit Mappings
used in SVMs
The basic idea behind the kernel trick is to map the data into a higher-dimensional space where it is more easily separable. However, rather than actually computing the transformed data, the kernel trick makes use of a kernel function that calculates the inner product between pairs of data points in the higher-dimensional space without actually transforming them.
every kernel K can be seen as a scalar product in some feature space H.
Mercer Kernel
Definition: Scalar Product and Hilbert space
Non-numeric Kernels
Discussion Kernels
Decision Tree Classifiers
Approximating discrete-valued target function
Learned function is represented as a tree:
A flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels or class
distribution
Tree can be transformed into decision rules:
Advantages
Decision trees represent explicit knowledge
Decision trees are intuitive to most users#
Goal:
Find splits which lead to as homogeneous groups as possible.
Usage:
Classifying an unknown sample by Traversing the tree
Decision tree generation (training phase)
1. Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
2. Tree pruning
Identify and remove branches that reflect noise or outliers
Basic algorithm (a greedy algorithm): ID3 (Iterative Dichotomiser 3)
top-down recursive divide-and-conquer manner
Attributes may be categorical or continuous-valued
At the start, all the training examples are assigned to the root node
Recursively partition examples at each node and push them down to the new nodes
Select test attributes and determine split points or split sets for the respective values
based on a heuristic or statistical measure (split strategy, e.g., information gain)
Conditions for stopping partitioning for ID3
All samples for a given node belong to the same class I
There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaF
There are no samples left
Split Strategies: Quality of Splits
Attribute Selection Measures: Information Gain
take attribute with highest information gain
Attribute Selection Measures: Gini Index
Attribute Selection Measures: Misclassification Error
Split Strategies: Types of Splits
Avoid Overfitting in Classification
overfitting due to
too many branches, some may reflect anomalies due to noise
outliers
avoid:
1. Post-pruning = pruning of overspecialized branches
2. Pre-pruning = halt tree construction early
Regularization
The L1-norm and L2-norm, respectively, are commonly used for regularizing the model’s parameter
used to fine-tune the model’s complexity —> Prevents overfitting of a model
Decision Tree Classifiers: Summary
Nearest Neighbor Classifiers
good for data in a non-vector representation: graphs, forms, XML-files, etc.
—> Direct usage of (dis-)similarity functions for objects
Eager vs Lazy evaluation
Decision Rules knn
Plain Decision Rules
Nearest Neighbor Rule (k = 1)
Just inherit the class label of the nearest training object: K(xq) = C(x1)
Majority Vote (k >= 1)
Choose majority class, i.e. the class with the most representatives in the decision set
Weighted Decision Rules
Distance-weighted majority vote
Give more emphasis to closer objects within decision set, e.g.:
Class-weighted majority vote
Use inverse frequency of classes in the training set (a-priori probabilities)
NN Classifier: Parameter k
Choosing an appropriate k: Tradeoff between overfitting and generalization:
k too small: High sensitivity against outliers (overfitting)
k too large: Decision set contains many objects from other classes (generalization)
NN Classifier: Variants
NN Classifier: Discussion
Ensemble Classification
Classification: Summary
Zuletzt geändertvor 2 Jahren