Are NNs convex and what does the answer mean w.r.t. finding the minimum loss?
NNs are non-convex
many different local minima, difficult to say which is global
What is a plateau?
learning rate + gradient steps very small
slow/no convergence
Gradient descent steps for a single training sample
initialize theta_1 with random values
th_k+1 = th_k - alpha * loss_gradient
iterate until convergence
Gradient descent steps for multiple training samples
initialize theta with random values
compute loss as average over all samples
update using gradient of average loss
Line search
compute gradient
find argmin_alpha
update theta
-> impractical bc expensive, solution: SGD
SGD: general idea + justification
train only on random portion of data
loss = expectation of all samples -> can be approximated!
Minibatch
small subset of data
hyperparameter m
typically a power of 2
different sample for each iteration
Epoch
complete pass through training set (forward + backward)
Iteration
number of passes where each pass uses m samples
SGD steps
for each iteration: new minibatch
compute gradient + update weights
SGD: problems (2)
gradient scaled equally across all dimensions -> conservative min learning rate
finding a good learning date is hard
SGD with momentum: idea
based on gradient history (accumulated gradients), change “velocity“ of gradients
velocity increases if many steps are taken in the same direction
SGD with momentum: parameters
alpha: learning rate
beta: accumulation/momentum rate, usually 0.9
SGD with momentum: formula
SGD with momentum: problem
may overshoot minima
Nesterov Momentum: idea
look ahead momentum
same risk of overshooting
Nesterov Momentum: formula
where
RMSProp: name + idea
Root Mean Squared Prop
divide learning rate by exponentially-decaying average of squared gradients
punish fluctuating gradients / divergence
RMSProp: formula
where s^k+1 is the second moment
RMSProp: justification
division by square gradients:
large gradients: large division
small gradients: small division
Adam: name + idea
Adaptive Moment Estimation
use momentum + RMSProp
Adam: formula + intuition for 1st order moment
mean of gradients
Adam: formula + intuition for 2nd order moment
uncentered variance of gradients
Adam: correct initialization bias
Adam: update rule
Newton’s method: idea
approximate loss function by 2nd order Taylor series expansion
complexity: O(k^3)
Newton’s method: advantages (2)
exploits curvature to take a more direct route
no learning rate (instead: inverse Hessian)
Newton’s method: disadvantages
inverse Hessian is expensive to compute
Broyden-Fletcher-Goldfarb-Shanno: idea
quasi-Newton
approximate inverse of Hessian
Gauss-Newton: idea
use second derivative (harder to get tho)
Levenberg: idea
damped Gauss-Newton
factor lambda ensures f(x_k) > f(x_k+1)
Levenberg-Marquardt: idea
scale components according to curvature
Optimization methods: usage
standard: Adam
fallback: SGD + momentum
full batch updates (not minibatches): Newton, L-BFGS, GN, LM
Last changeda year ago