Numeric Attributes (before nominal attributes)
standard method: binary splits
NEW: nominal attributes has many possible split points
basic approach:
place split points halfway between values
extended approach:
evaluate for every possible split point of attribute
choose best split point
best split point creates 2 sets with maximal measure
=> computationally demanding
Binary vs. multi-way Splits
Splitting on a nominal attribute exhausts all information in that attribute
nominal attribute is testet once on any path in tree
C4.5 uses binary split, by numeric useful but not in every cases
in same cases single split will not increase information
what kind of split does c4.5 use?
Missing Values
ID3 can not handle cases with missings / NA
C4.5 allow missings with the form of “?”
Information Gain works as before - unknown values are not included in calculations
Pruning
When the decision tree is built, many of the branches will reflect anomalies in training data due to noise or outliers
Goal: prevent overfitting
2 Strategies
Prepruning - stop growing branches by unreliable information
stop creating subtree when
number of samples below treshold
information gain below treshold
depth of tree beyond treshold
based on statistical significance test
Postpruning - fully grown tree -> discard unreliable parts
2 operations
subtree replacement
subtree raising
What is the preferred Pruning method?
C5.0
C 5.0: The Successor
Variable misclassification costs
Case weight attribute that quantifies the importance of each observation (case)
Winnowing (feature selection) function integrated for high-dimensional data
Allows for several interesting data types such as dates, times, timestamps
C5.0's trees are noticeably smaller and C5.0 is faster by factors of 3, 5, and 15 respectively.
Boosting
technique for generating and combining multiple classifiers to improve predictive accuracy
bossting can reduce classification accuracy for noisy cases
Ensemble Learning
Problem: classifier with low bias tend to have high variance
approach: use several classifiers
selection: each classifier is a local expert in some local neighborhood of the feature space
fusion: all classifiers are trained over the entire feature space, and then combined to obtain a composite classifier with lower variance and lower error
combine smaller methods to a big one
Steps of Ensemble Methods
Data sambling and selection
completely random
following a strategy
training of the component classifiers
mechanism to combine the classifiers
discrete predictions: simple or weighted majority voting
continuous predictions: mean rule, weighted avg. min max median
Random Forests
each tree works depending on a collection of random variables
each tree fit to an bootstrap sample from original data (bagging)
best split is found over randomly selected subset of predictor variables at each node
combination
discrete: unweighted voting
continuous: unweighted avg
parameter to choose for random forest
discrete: square root of total numbers of predictors
continuous: sample size/3
number of trees in the forest
choose rather large, usually error rate converges with increasing number of trees
tree size -> smallest node size for plitting or max number of terminal nodes
Decision Trees Valuation
pros
cons
Decision trees provide understandable decision rules
Fast classification
Continuous & discrete variables can be processed
Attributes providing most classification power can be identified
Easy extension: random forests provide better results (ensemble learning with several trees)
Not that adequate for estimation tasks
Error-prone with regard to too many classes and small dataset
Calculation effort with regard to model building
Last changed3 months ago