What is classification? Informal
The systematic assignment of new observations to known categories according to criteria learned from a training set
What is a classifier? Formal
A classifier K for a model M(θ) is a function K_M(θ): D -> Y, where
D: data space
often d-dim space with attributes a1,…,ad
some other metric space
Y = {y1,…,yk} : set of k distinct class labels
O ⊆ D: set of training objects o with known class labels y∈Y
What is classification (formal) and what is meant by M(θ) and what is supervised learning?
Classification
Application of classifier K on objects from D
M(θ)
The ‘type’ of classifier, and θ are the model parameters
Supervised Learning
Finding/learning optimal parameters θ for M(θ) given O (set of training objects)
What is numerical prediction? include a method and example
Determining the numerical value of an object
Method example: regression analysis
Example: Prediction of flight delays i.e. given the wind speed, what is the predicted delay of flight
How is numerical prediction different and similar to classification?
Different
Classification refers to predict categorical class labels
Numerical prediction models continuous-valued function
Similar
Model constructed
Model used to predict unknown value
Regression is a major method of both
What are the quality measures for classifiers?
Accuracy/error (complementary)
Compactness of model (tree size, #rules)
Interpretability of the model
Efficiency
Scalability for large databases
Robustness
How is accuracy and error of a classifier K measured?
Let C(o) denote the correct class label of an object o
Predict the class label for each object o from a data set T⊆O
Determine the fraction of correctly predicted class labels
Note: for a data set with classes of different sizes, accuracy is misleading
What is true positive?
It is true that the object is in the predicted class A
What is false positive?
It is false that the object is in predicted class A
What is true negative?
It is true that the object is in the predicted class !A
What is false negative?
It is false that the object is in the predicted class !A
How is recall measured?
[0.1]
The larger the better
= TP / (TP + FN)
How is precision measured?
[0,1]
= TP / (TP + FP)
How is F_1 measured?
= 2 rec prec / (rec + prec)
= 2TP / (2TP + FN + FP)
What is recall?
The fraction of test objects of class i, which have been correctly identified.
What is precision?
Fraction of test objects assigned to class i, which have been identified correctly
What is the train-and-test evaluation of classifiers?
Decomposition of labeled data set O into 2 partitions
training data is used to train the classifier
test data is use to evaluate the classifier
When is the train-and-test evaluation not applicable?
If the set of objects for which the class label is known is very small
What is the process of m-fold cross validation?
Decompose data set evenly into m subsets of (nearly) equal size
Iteratively use (m-1) partitions for training data and the remaining single partition as test data
Combine the m classification accuracy values to an overall classification accuracy
What is the process of leave-one-out cross validation? What is this method especially good for?
For each of the objects o in the data set O:
use set O / {o} as training set
use the singleton set {o} as the test set
Compute classification accuracy by dividing the number of correct predictions through the database size |O|
Especially good for nearest-neighbour classifiers
What is the resubsitution error of classifier K?
The error that a classifier would make if it were used to predict the class of the training data it was trained on
This error can be calculated by training the classifier on a subset of the training data and then evaluating its performance on the rest of the training data
What is the (true) classification error of classifier K?
error that a classifier makes when it is used to predict the class of new data that it has never seen before.
This error is typically estimated using a separate test set that the classifier has not been trained on.
What is the characterization of overfitting?
The classifier adapts too closely to the training dataset and may therefore fail to accurately predict class labels for test objects unseen during training
What are potential reasons for overfitting?
Bad quality of training data (noise, missing values, wrong values)
Different statistical characteristics of training data and test data
How can one avoid overfitting?
Removal of noisy/erroneous/contradicting training data
Choice of an appropriate size of the training set
not too small, not too large
Choice of appropriate sample
sample should describe all aspects of the domain and not only parts of it
What is underfitting?
Occurs when the classifier’s model is too simple, e.g. trying to separate classes linearly that can only be separated by a quadratic surface
What is the basic idea behind Bayes Classification?
Probability based classification: given a query object q, assign to q the class with the highest probability:
The conditional probabilities P(cj | q) are hard to estimate, so turn the rule by applying Bayes’ theorem to a formular based on P(q | cj).
Estimate the required probability density functions by using distribution models learned from the training data
How can conditional probabilities be expressed in terms of joint probabilities?
P(A|B) = P(A ^ B) / P(B)
What is the key theorem of Bayes’?
When applying Bayes’ rule to a classifier, what are we interested in?
How can one estimate the apriori probabilities P(cj) and the conditional probabilities P( x | cj)?
Apriori Probabilities P(c_j):
Use the observed relative frequency of the individual class labels c_j in the training set, i.e.,
P(cj) = N_c_j / N
Conditional Probabilities P(x | c_j):
Non-parametric methods: Kernel methods - Parzen’s window, Gaussian kernels, etc.
Parametric methods:
Single Gaussian Distribution: Computed using maximum likelihood estimators
Mixture models: e.g. Gaussian Mixture Model computed by EM algorithm
What are the problems with Bayes classifiers? What are their solutions?
Problems
Correlations of attributes are hardly available if training sets are small
Curse of dimensionality may cause problems in high-dimensional data
Solutions
Dimensionality reduction
Assume statistical independence of single attributes -> naive Bayes
What is the independence assumption of the Naive Bayes Classifier?
For any given class cj the attribute values xi are distributed independently, i.e.
What is the decision rule for the Naive Bayes Classifier?
The Maximum A Posteriori (MAP) decision rule. This rule states that for a given data point x, the classifier should assign it to the class c that maximizes the posterior probability P(c | x).
In other words, the classifier should predict the class that is most likely given the observed data.
In the context of a Naive Bayes Classifier, how can a categorical attribute x_i be estimated?
P(xi | cj) can be estimated as the relative frequency of samples having value vi as the ith attribute in class cj in the training set
Easy to compute
In the context of a Naive Bayes Classifier, how can a continuous attribute x_i be estimated?
P(xi, cj) can be estimated through a Gaussian distribution determined by μij,σij
Computationally easy
Naive Bayes Classifier Example
What are the pros and cons of Bayesian Classifiers?
Pros
High classification accuracy for many applications if density function defined properly
Incremental computation: many models can be updated to new training objects by updating densities
Incorporation of expert knowledge about the application in the prior P(C_i)
Cons
Limited applicability: often required conditional probabilities are not available
Lack of efficient computation: in case of a high number of attributes
With Bayesian Classifiers what is the problem with the independence hypothesis and how can one overcome them?
It is seldom satisfied in practice, as attributes are often correlated
Overcome
Bayesian networks: combine Bayesian reasoning with causal relationships between attributes
Decision trees: that reason one 1 attribute at the time, considering most important attributes first
What is the idea behind linear discriminant function classifiers?
Separate points of two classes by a hyperplane
i.e. classification model is a hyperplane
points of one class in one half space, points of second class in the other half space
What does ⟨x,y⟩ mean?
The inner product of two vectors x,y ∈ V
e.g. the scalar product ⟨x,y⟩= x^Ty = sigma^d{i=1}x_iy_i
What does H(w, w_0) denote?
A hyperplane with normal vector w and constant term w_0
x ∈ H <-> ⟨x , w ⟩ + w_0 = 0
The normal vector w may be normalized to w’
What is the classification rule in Linear Discriminant Function Classifiers and what does it mean?
A hyperplane with normal vector w and constant term can be used to classify data points into two classes
if w^T x + w_0 > 0 then x is in class 1 else x is in class 2
In linear discriminant function classifiers how can one estimate w and w_0?
Define an objective/loss function L(.) that assigns a value (i.e error of the training set) to each parameter configuration
Optimal parameters minimize/maximize the objective function
In linear discriminant function classifiers, how does an objective function look like?
Different choices possible
Most intuitive: each misclassified object contributes a constant (loss) value
i.e. 0-1 loss
What is the 0-1 loss objective for a linear classifier?
With linear discriminant function classes, what could we do if we have more than two (k>2) classes?
One-vs-the-rest (one vs all) = k classifiers
One-vs-one (majority vote of classifiers) = k(k-1) / 2 classifiers
Multiclass linear classifiers = k classifiers
What is the idea of a multiclass linear classifier? What is the advantage of this?
Advantage: no ambiguous regions except for points on decision hyperplanes
What are the advantages of using linear discriminant functions?
Simple approach
Closed form solution for parameters
Easily extendable to non-linear spaces
What are the disadvantages of linear discriminant functions?
Sensitive to outliers - depending on loss function - not a stable classifier
Only good results for linearly separatable data
Expensive computation of selected hyperplanes
What is the basic idea behind Support Vector Machines?
Linear separation with the Maximum Margin Hyperplane (MMH)
Distance to points from any of the two sets is maximal i.e. at least ξ
Minimal probability that the separating hyperplane has to be moved due to an insertion
What are support vectors?
Maximum Margin Hyperplane only depends on points p_i whose distance to the hyperplane is exactly ξ.
These pi are called support vectors
What are the formalization requirements needed for SVMs?
Let xi ∈ R^d denote the data points, and yi = +1 if first class, else y_i = -1
A hyperplane in Hesse normal form is represented by a normal vector w ∈ R^d of unit length (i.e. ||w||_2 = 1), and a (signed) distance from the origin b ∈ R
What are the requirements of the maximum margin hyperplane?
What are the pros of SVMs?
generate classifiers with a high classification accuracy
relatively weak tendency to overfitting (generalization theory)
efficient classification of new objects due to often small number of support vectors
compact models
What are the cons of SVMs?
training times may be long (feature space may be very high-dimensional)
expensive implementation
What is the core idea behind non-linear classification?
Map to linearly discriminant functions (phi) by using non-linear transformation into an (extended) space
Separate the data in the new space by linear methods
i.e.
What are the pros of non-linear classification?
Explicit feature transformation yields a richer hypotheses space
Simple extension of existing techniques
Efficient evaluation, if transformed feature space not too high-dimensional
What are the cons of non-linear discussion?
Explicit mapping may run into problems (efficiency, high dimensions)
Meaningful transformation is usually not known a-priori
Complex data distributions may require very high-dimensional features spaces -> high memory consumption, high computational costs
What is the premise and what occurs in the kernel trick?
We often need the scalar product of mapped objects only, K_{phi}(x,y) = <phi(x), phi(y)>
If K_{phi}(x,y) is represented in the original domain, the mapping phi remains implicit only, and the problems of mapping explicitly are avoided
Kernel allows for mappings with infinite dimensions
Why are scalar products useful?
Kernels correspond to scalar products in the corresponding feature space
Scalar products and, thus, kernels are used in various definitions:
What is the definition of a kernel function?
A kernel function K: chi x chi -> R on an input space chi is a symmetric function which maps pairs of objects x,y ∈ chi to real numbers.
What is the definition of a Mercer kernel?
A kernel function K is called Mercer kernel, valid kernel, admissible kernel, or positive semi-definitive kernel, if for all finite subsets X = {x1, … , xn} ⊆ chi, the n x n matrix M_K(X) with M_K(X)_{i,j} is positive semi-definite, i.e. for all c ∈ R^n, it holds
c^T · M_K(X) · c >= 0
What is the definition of a scalar product?
What is a Hilbert space?
A vector spacew H endowed with a scalar product <·,·>: H x H -> R for which the induced norm gives a complete metric space, is called a Hilbert space.
What is the theorem which demonstrates the connection between Hilbert space and kernels?
every kernel K can be seen as a scalar product in some feature space H
What are the advantages of using Kernel functions?
Feature space H can be infinite-dimensional
Not really necessary to know which feature space H we have
Computation of kernel is done in original domain chi
What are examples of Mercer kernels?
Linear
Polynomial
Radial basis function (RBF) / Gaussian Kernel
What are general kernels?
The kernel trick may be applied to non-numerical, structured objects as well.
Kernels K(x,y) are defined in terms of object properties
K(x,y) = <phi(x), phi(y)> remains a scalar product in R^d but phi : chi -> R^d is not needed explicitly
Examples:
Graph Kernels
Fisher Kernels: digest complex objects by pooling features
What are the pros of using Kernel methods?
Kernel methods provide a great method for dealing with non-linearity
Implicit mapping allows for spaces of arbitrary dimensions (even infinite)
Computational effort of training Kernel machines depends on the number of training examples, but not on the feature space dimension
What are the cons of Kernel methods?
Resulting methods may be hard to intuitively explain
Choice of kernel may be difficult
How do decision tree classifiers work?
Approximating discrete-valued target function
Learned function represented as a tree:
flow-chart-like tree structure
Internal node denotes a test on an attribute
Branch represents outcome of the test
Leaf nodes represent class labels or class distribution
Tree can be transformed into decision rules
What are the pros of decision tree classifiers?
Relatively fast learning speed
Fast classification speed
Convertible to simple and easy to understand classification rules
Often comparable classification accuracy with other classification models
Decision trees represent explicit knowledge
Decision trees are intuitive to most users
What are the cons of decision tree classifiers?
Not very stable, small changes of the data can lead to large changes of the tree
What is the goal of decision tree classifier splits?
Each tree node defines an axis-parallel (d-1) dimensional hyperplane, that splits the data space
Find such splits which lead to as homogenous groups as possible
What are the two phases of decision tree generation/ training phase?
Tree construction
At start, all the training examples are at the root
Partition examples recursively based on selected attributes
Tree pruning
Identify and remove branches that reflect noise or outliers
How does one use decision trees?
Traverse the tree and test the attribute values of the sample against the decision tree
Assign the class label of the respective leaf to the query object
What is the algorithm for decision tree construction?
ID3 (Iterative Dichotomiser 3)
Tree is created in a top-down recursive divide-and-conquer manner
Attributes may be categorical or continuous-valued
At the start, all the training examples assigned to root node
Recursively partition examples at each node and push them down to the new nodes
Select test attributes and determine split points or split sets for the respective values based on a heuristic or statistical measure (split stategy e.g. information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
No samples left
How is entropy used for node tree splits?
Minimum amount of bits to encode a message that contains the class label of a random training object
The entropy of a set T of training objects is defined as
for k classes with frequencies p_i.
entropy(T) = 0 is pi = 1 for any class ci
entropy(T) = 1 if pi = 1/k for all classes ci
How is information gain mathematically defined? Example as well
How is the Gini index defined as what do the values mean?
What is the misclassification error for a set T of objects defined as and for an attribute A?
Attribute selection measures comparison graph?
What is the tree split strategy with categorical attributes?
Split criteria based on equality “attribute = a”
Based on subset relationships “attributes ∈ set“
many possible choices (subset)
choose best split according to e.g. gini index
What is the split strategy of numerical attributes?
Split criteria of the form “attribute < a“
many possible choices for split point
approach idea: order test samples w.r.t their attribute value, consider every mean value between two adjacent samples as possible split point
partition the attribute value into discrete set of intervals (binning)
How may a generated tree overfit the training data?
Too many branches, some may reflect anomalies due to noise or outliers
Results has poor accuracy for unseen data
What are the two main approaches to avoid overfitting for decision trees?
Post-pruning = pruning of overspecialized branches
Pre-pruning = halt tree construction early
How can overspecialized branches be pruned?
Remove branches from a ‘fully grown’ tree and get a sequence of progressively pruned trees
Use a set of data different from the training data to decide which is the ‘best pruned tree’
How does pre-pruning trees work?
Halt tree construction early, do not split a threshold if this would result in the goodness measure falling below a threshold.
Choice of an appropriate value for minimum support
= minimum number of data objects a leaf node contains, in general >> 1
Choice of an appropriate value for minimum confidence
= minimum fraction of the majority class in a leaf node, in general << 100%
Leaf nodes can absorb errors or noise in data records
How can pre-pruning be difficult?
Difficult to choose appropriate thresholds
Pre-pruning has less information for the pruning decisions than post-pruning -> can be expected to produce trees with lower classification quality
Tradeoff: tree construction time vs classification quality
What is the mathematical basis/goal behind regularization?
Simply, what is the goal of regularization?
To prevent overfitting of a model
to make the model generalize better
What is involved in the minimal cost complexity pruning algorithm?
Recursively evaluates and pruens the weakest links in the decision tree by looking at the sub trees.
This is done until the cost complexity of the tree is minimized
What are the advantages of the minimal cost complexity pruning algorithm?
Reduces overfitting
Improves generalization
Enchances interpretability
What is the motivation behind using Nearest Neighbor Classifiers?
Assume data is in a non-vector representation: graphs, forms, XML-files, etc.
No simple way to use linear classifiers or decision trees
What solutions do nearest neighbor classifiers use?
Use appropriate kernel function for kernel machines (e.g. kernel SVM)
Embedding of objects into some vector space (representation learning)
Direct usage of (dis-)similiarity functions for objects
What procedure to nearest neighbor classifiers use for training?
Assign query object q to the class c_j of the closes training object x ∈ D
class(q) = class(NN(q))
NN(q) = {x ∈ D | ∀x′ ∈ D: d(q,x) <= d(q, x’)}
What is eager evaluation?
Where the prediction for a new instance is computed immediately when presented to the model
What are examples of algorithms that use eager evaluation?
Decision tree
Bayes classifier
SVM
What occurs in the training and test phase in eager evaluation?
Training phase: learn parameters for chosen model from training data
Test phase:evaluate parametrized model for arriving query objects
What is a typical approach for a lazy evaluation algorithm?
k-nearest neighbor classifiers
Derive labels from individual training objects: instance based learning
What is ‘lazy’ about lazy evaluation algorithms?
No training required
What is highly recommended to do when using lazy evaluation algorithms?
Put data into efficient index structures (e.g. R-Tree)
What are the notions for nearest neigbor classifiers?
Distance function: Defines the (dis-)similarity for pairs of objects
Decision set: The set of k nearest neighbouring objects used in the decision rule
What is the decision rule?
Given the class labels of the objects from the decision set, how can one derive the class labels for the query object
What are different decision rules that nearest neighbour classifiers can use? (names)
Nearest neighbour rule (k=1)
Majority vote (k>=1)
Distance-weighted majority vote
Class-weighted majority vote
What are different decision rules that nearest neighbour classifiers can use? (in depth)
In a k nearest-neighbour classifier what is the influence of k if it is too small or too large?
Too small: high sensitivity against outliers
Too large: decision set contains many objects from other classes
What is the rule of thumb when choosing the parameter k in a nearest neighbour classifier?
Based on theoretical considerations: choose k, such that it grows slowly with n, e.g. k = sqrt(n), or k = log n
1 << k < 10 yields a high classification accuracy in many cases
What are variants of the nearest neighbour classifier and briefly what do they do?
k-NN classifier: consider the k nearest neighbours for the class assignment decision
Weighted k-NN classifier: use weights for the classes of the k nearest neighbours
Mean-based NN classifier: determine mean vector mi for each class cj (in training phase); assign query object to the class cj of the nearest mean vector mi
Generalization: Representative-based NN mean classifier; use more than one representative per class - no longer just instance-based
What are the pros of nearest neighbor classifiers?
Applicability: training data and distance function required only
High classification accuracy in many applications
Easy incremental adaptation to new training objects useful also for prediction
Robust to noisy data by averaging k-nearest neighbours
What are the contras of nearest neighbour classifiers?
Naive implementation is inefficient: requires k-nearest neighbour query processing but with database support techniques to goes from O(n) to O(log n)
Does not produce explicit knowledge about classes, but provides explanations
Curst of dimensionality: distance between neighbours could be dominated by irrelevant attributes, apply dimensionality reduction first.
What is the problem that ensemble classification tackles?
No single classifier performs good on every problem
For some techniques, small changes in the training set lead to very different classifiers
What is the idea behind ensemble classification?
Improve performance by combining different classifiers = ensemble classification
Different possibilities exist
bagging (bootstrap aggregation)
boosting
How does bagging work?
Randomly select m different subsets from the training set
On each subset, independently train a classifier K_i (i = 1,…,m)
Overall decision:
How does boosting work?
Linear combination of several weak lerners (different classifiers)
Given m weak learners Ki and weights alphai for i=1,…,m
Important difference: classifiers trained in sequence
Repeatedly misclassifies points are weighted stronger
Classification models: summary table
Classification: Conclusion
Extensively studied problem
Widely used data mining techniques with a lot of extensions
Scalability important issue
Many research directions
Results can be improved by ensemble classification
Last changeda year ago