Scaling Optimization / SGD

Buffl

I2DL

von Mira T.

Are NNs convex and what does the answer mean w.r.t. finding the minimum loss?

NNs are non-convex
many different local minima, difficult to say which is global

What is a plateau?

learning rate + gradient steps very small
slow/no convergence

Gradient descent steps for a single training sample

initialize theta_1 with random values
th_k+1 = th_k - alpha * loss_gradient
iterate until convergence

Gradient descent steps for multiple training samples

initialize theta with random values
compute loss as average over all samples
update using gradient of average loss

Line search

compute gradient
find argmin_alpha
update theta

-> impractical bc expensive, solution: SGD

SGD: general idea + justification

train only on random portion of data
loss = expectation of all samples -> can be approximated!

Minibatch

small subset of data

hyperparameter m
typically a power of 2
different sample for each iteration

Epoch

complete pass through training set (forward + backward)

Iteration

number of passes where each pass uses m samples

SGD steps

for each iteration: new minibatch
compute gradient + update weights

SGD: problems (2)

gradient scaled equally across all dimensions -> conservative min learning rate
finding a good learning date is hard

SGD with momentum: idea

based on gradient history (accumulated gradients), change “velocity“ of gradients
velocity increases if many steps are taken in the same direction

SGD with momentum: parameters

alpha: learning rate
beta: accumulation/momentum rate, usually 0.9

SGD with momentum: formula

SGD with momentum: problem

may overshoot minima

Nesterov Momentum: idea

look ahead momentum
same risk of overshooting

Nesterov Momentum: formula

where

RMSProp: name + idea

Root Mean Squared Prop
divide learning rate by exponentially-decaying average of squared gradients
punish fluctuating gradients / divergence

RMSProp: formula

where s^k+1 is the second moment

RMSProp: justification

division by square gradients:

large gradients: large division
small gradients: small division

Adam: name + idea

Adaptive Moment Estimation
use momentum + RMSProp

Adam: formula + intuition for 1st order moment

mean of gradients

Adam: formula + intuition for 2nd order moment

uncentered variance of gradients

Adam: correct initialization bias

Adam: update rule

Newton’s method: idea

approximate loss function by 2nd order Taylor series expansion
complexity: O(k^3)

Newton’s method: advantages (2)

exploits curvature to take a more direct route
no learning rate (instead: inverse Hessian)

Newton’s method: disadvantages

inverse Hessian is expensive to compute

Broyden-Fletcher-Goldfarb-Shanno: idea

quasi-Newton
approximate inverse of Hessian

Gauss-Newton: idea

use second derivative (harder to get tho)

Levenberg: idea

damped Gauss-Newton
factor lambda ensures f(x_k) > f(x_k+1)

Levenberg-Marquardt: idea

scale components according to curvature

Optimization methods: usage

standard: Adam
fallback: SGD + momentum
full batch updates (not minibatches): Newton, L-BFGS, GN, LM

Beitreten

Vorschau

Author

Mira T.

Informationen

Zuletzt geändert
vor 2 Jahren

Kurs melden