Decision Trees
non parametric
supervised
used for classification and regression
CART
variant of decision tree algorithm
predictive algorithm in ML
classification and regression
learns from labelled data to predict unseen data
Goal: create model that predicts the value of a target variable by learning simple decision rules inferred from the data variables
Nodes are split into Subnodes -> by searching for best homogenity
root node taken as training set and is split into two by considering best attribute and treshold value
splitting criteria: evaluate all possible splits and selects the one that best reduces the impurity of the resulting subsets
splitting stops when impurity below certain treshold
prevent overfitting by pruning -> removes nodes that cintribute little to models accuracy
Which model can handle classification and regression?
How works the CART algorithm
best split point of each input is obtained
based on best split points of each input in Step 1, the new “best” split point is identified
Split the chosen input according to the best split point
continue splitting until a stopping rule is satisfied or no further splitting available
What are the pros and cons of CART
Pros
simplistic results
nonparametrix and nonlinear
implicity perform feature selection
outliers have no effect
requires minimal superviseion and easy to understand model
cons
Overfitting
Greedy algorithms -> not optimal results
Trees might be biased if a specific class dominates the dataset
ID3 Algorithm
decision tree from a fixed set of examples
top down, greedy
classify future samples
explain the 5 steps
calculate Information Gane of each future
split dataset S into subsets - using feature with Information Gain is max
make decision tree node using feature with information gain max
if all rows belong to same class, make current node as a leaf node with class as labels
repeat with remaining features until no features left, or tree has all leaf nodes
What are the 2 Criteria of ID3
Entropy (Overall level of uncertainty)
Information Gain: (Information most usefull classification)
ID3
Entropy and Information Gain
Calculations
Hinds:
Entropy:
Information Gain:
advantages
advantage
simple and inexpensive
fast at classifying unknown records
easy interpretation for small sized trees
robust to noise
ease handle on irrelevant attributes
disadvantages
space of trees is exponentially large -> greedy are often unable to find best trees
produce only small trees not the smallest possible tree
can only handle normal nominal attributes
each decision boundary involves only a single attribute
C 4.5
based on ID3
gain ratio for attribute selection
permit numeric attributes
new: sinsible with missing values
pruning to deal with noise
one of best known und widely used learning algorithms
what is the problem with information gain criteria?
Attributes with more possible values will often have highest information gain
=> Attributes selected due to lots of values not because of real relations
Gain Ratio
2 Criteria
Information Gain (Information being the most useful for classification)
Split Info (measures the potential information generated by splitting data based on an attribute)
C4.5
Steps
2 Steps
Calculate Split Information
Compute Gain Ratio
Information Gain
Split Info
determine best attribute fir splitting data on how well it separates the classes
Split info
helps penalize splits that divide data too finelsy by considering the information of the split
combines both to select the attribute that splits the data well and avoid overfitting
What problem of ID3 is fixed by C4.5
Classification Problem
ID3 uses Information Gain -> Problem: Attributes with more values are higher rated
=> normalizing information gain of each candidate attribute, taking into account number and sizes of branches
c4.5 uses gain ratio -> better for measure usefulness of attribute to classification problem
which model can handle missing within calculation?
Last changed5 months ago