undefined

Buffl

Data Science

von Felix S.

SVM

Main idea

binary classification algorithm (discriminative)
solve classification problems with continous outcome or for clustering problems
usefull for multi class classifications
supervised
usefull if its not linear

Steps

Use optamization to find solution (hyperplane)
seek margin to improve generalization (margin)
use kernel trick to make spaces efficient (Kernel)

SVM

why important?

robust to very large numver of variables and small smaples
can learn both simple and highly compley classification models
employ sophisticated mathematical principles to avoid overfitting

SVM

steps

hyperplane building -> split the data into classes
maximize the margin -> find the bigest gap between the classes, higher gap means better result (max margin -> min 𝑤)
support vector -> points near the margin are the support vector
kernel -> if its linear not possible -> transform into 3D where splitting is now possible again

Basic operation on vectors

Multiplication by scalar -> stretch in same or opposite direction depending on scalar is positive or negative
addition
subtraction
L2 Norm -> typical way to measure length of a vector (other methods: L1, Manhattan)
Dot Product

Addition:

Subtraction:

Hyperplane

linear decision surface that splits the spaces into two part
binary classifier
hyperplane in ℝ𝑛 is an 𝑛 − 1 space

Convex functions

convex = functin lies below the straight line
any local minimum is a global minimum

what will minimze 𝑤?

Quadratic Programming (QP)

special optimization problem -> function to optimize is quadratic but subject to linear constraints
convex objective functions
solved by greedy algorithms
finding optimal margin -> convex QP

What is a greedy algorithm?

„A greedy algorithm always makes the choice that looks best at the moment. That is, it makes a locally optimal choice in the hope that this choice will lead to a globally optimal solution’’

Soft margins

midified max margin that allows for mislabeld data points
choose hyperplane that splits data points as cleanly as possible
slack variables measure the degree of misclassification
objective function is increased by a function that penalizes non-zero 𝜉

SVM for non-linear classification

using kernel trick -> every dot is replaced by non linear kernel function
transform non linear into linearly spaces
then: max margin hyperplanes are constructed to optimally separate different classes from each other based on the training dataset

Typical kernels