undefined

Buffl

KDD1

by Leonie P.

Classsifier - Formal setup, examples

predict categorical class label

examples

Bayes classifier
Linear classifier
decision tree
k-NN classifier

Numerical Prediction

Regression (analysis)

Determine the numerical value of an object
models continuous-valued functions

Quality Measures for Classifiers

Classifier - Accuracy and Error

For a data set with classes of di↵erent size accuracy is misleading.—> need other measurements

External Measures

for a asingle class

Cross Validation

Overfitting

The classifier adapts too closely to the training dataset and may therefore fail to accurately predict class labels for test objects unseen during training.

Underfitting

Occurs when the classifiers model is too simple, e.g. trying to separate classes linearly that can only be separated by a quadratic surface

Bayes Classification

Key theorem

Naive Bayes Classifier

Independence Assumption

For any given class cj the attribute values xi are distributed independently

Computationally easy for categorical and contonous attributes

Bayesian Classifiers: Discussion

Linear Discriminant Function Classifier

Separate points of two classes by a hyperplane

I.e., classification model is a hyperplane
Points of one class in one half space, points of second class in the other half space

Linear Discriminant Function Classifier

multi class extention

Algebraic notions for vector space

Discussion (Linear Discriminant Function)

support vector machines

with: xi = datapoints, +1: yi = +1, if first class, else yi = 1., normal vector w of unit length , b distance from origin

Soft Margin Optimization

SVM: Discussion

Non-Linearly Separable Data Sets

Linear classifier//SVM can be easily extended to non-linear space

—> Extension of Linear Discriminant Function Classifierc

Extension of the Hypotheses Space

Parabola Example

Circular Example

Non-linear Classification: Discussion

Kernel Trick

Kernel Function

Implicit Mappings
used in SVMs

The basic idea behind the kernel trick is to map the data into a higher-dimensional space where it is more easily separable. However, rather than actually computing the transformed data, the kernel trick makes use of a kernel function that calculates the inner product between pairs of data points in the higher-dimensional space without actually transforming them.

every kernel K can be seen as a scalar product in some feature space H.

Mercer Kernel

Definition: Scalar Product and Hilbert space

Non-numeric Kernels

Discussion Kernels

Decision Tree Classifiers

Approximating discrete-valued target function
Learned function is represented as a tree:
- A flow-chart-like tree structure
- Internal node denotes a test on an attribute
- Branch represents an outcome of the test
- Leaf nodes represent class labels or class
  distribution
Tree can be transformed into decision rules:
Advantages
- Decision trees represent explicit knowledge
- Decision trees are intuitive to most users#
Goal:
- Find splits which lead to as homogeneous groups as possible.
Usage:
- Classifying an unknown sample by Traversing the tree

Decision tree generation (training phase)

1. Tree construction

At start, all the training examples are at the root
Partition examples recursively based on selected attributes

2. Tree pruning

Identify and remove branches that reflect noise or outliers

Basic algorithm (a greedy algorithm): ID3 (Iterative Dichotomiser 3)

top-down recursive divide-and-conquer manner
Attributes may be categorical or continuous-valued
At the start, all the training examples are assigned to the root node
Recursively partition examples at each node and push them down to the new nodes
Select test attributes and determine split points or split sets for the respective values
based on a heuristic or statistical measure (split strategy, e.g., information gain)
Conditions for stopping partitioning for ID3
- All samples for a given node belong to the same class I
- There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaF
- There are no samples left

Split Strategies: Quality of Splits

Attribute Selection Measures: Information Gain

take attribute with highest information gain

Attribute Selection Measures: Gini Index

Attribute Selection Measures: Misclassification Error

Split Strategies: Types of Splits

Avoid Overfitting in Classification

overfitting due to

too many branches, some may reflect anomalies due to noise
outliers

avoid:

1. Post-pruning = pruning of overspecialized branches

2. Pre-pruning = halt tree construction early

Regularization

The L1-norm and L2-norm, respectively, are commonly used for regularizing the model’s parameter
used to fine-tune the model’s complexity —> Prevents overfitting of a model

Decision Tree Classifiers: Summary

Nearest Neighbor Classifiers

good for data in a non-vector representation: graphs, forms, XML-files, etc.

—> Direct usage of (dis-)similarity functions for objects

Eager vs Lazy evaluation

Decision Rules knn

Plain Decision Rules

Nearest Neighbor Rule (k = 1)
- Just inherit the class label of the nearest training object: K(xq) = C(x1)
Majority Vote (k >= 1)
- Choose majority class, i.e. the class with the most representatives in the decision set

Weighted Decision Rules