SVM
Main idea
binary classification algorithm (discriminative)
solve classification problems with continous outcome or for clustering problems
usefull for multi class classifications
supervised
usefull if its not linear
Steps
Use optamization to find solution (hyperplane)
seek margin to improve generalization (margin)
use kernel trick to make spaces efficient (Kernel)
why important?
robust to very large numver of variables and small smaples
can learn both simple and highly compley classification models
employ sophisticated mathematical principles to avoid overfitting
steps
hyperplane building -> split the data into classes
maximize the margin -> find the bigest gap between the classes, higher gap means better result (max margin -> min π€)
support vector -> points near the margin are the support vector
kernel -> if its linear not possible -> transform into 3D where splitting is now possible again
Basic operation on vectors
Multiplication by scalar -> stretch in same or opposite direction depending on scalar is positive or negative
addition
subtraction
L2 Norm -> typical way to measure length of a vector (other methods: L1, Manhattan)
Dot Product
Addition:
Subtraction:
Hyperplane
linear decision surface that splits the spaces into two part
binary classifier
hyperplane in βπ is an π β 1 space
Convex functions
convex = functin lies below the straight line
any local minimum is a global minimum
what will minimze π€?
Quadratic Programming (QP)
special optimization problem -> function to optimize is quadratic but subject to linear constraints
convex objective functions
solved by greedy algorithms
finding optimal margin -> convex QP
What is a greedy algorithm?
βA greedy algorithm always makes the choice that looks best at the moment. That is, it makes a locally optimal choice in the hope that this choice will lead to a globally optimal solutionββ
Soft margins
midified max margin that allows for mislabeld data points
choose hyperplane that splits data points as cleanly as possible
slack variables measure the degree of misclassification
objective function is increased by a function that penalizes non-zero π
SVM for non-linear classification
using kernel trick -> every dot is replaced by non linear kernel function
transform non linear into linearly spaces
then: max margin hyperplanes are constructed to optimally separate different classes from each other based on the training dataset
Typical kernels
once Kernel is selected, only penalty parameter C and the kernels parameters have to be selected
pros and cons
pros
Grounded on sound statistical learning theory
Does not suffer from only finding local minima
(βSimpleβ) geometric interpretation
Do less suffer overfitting problem
cons
Some parameters and the kernel have to be chosen
Model building involves time-demanding calculations
Extensive memory requirements for solving the QP
Last changed4 months ago