Forward pass
calculates function
caches values needed for gradient computation
Backward pass
takes upstream derivative
return all partial derivatives
Gradient update rule
wrt W
Number of weights in a layer
#neurons * #inputchannels + #biases
What happens if the learning rate is too high / too low?
too high: model may not converge to an optimal solution or converge too fast to suboptimum
too low: model converges very slowly
Gradients for large training set?
compute gradient for individual training pairs/batches -> take average
Additive regularization: general goal
prevent overfitting / hopefully better generalization
make training harder so model learns better features
lower validation error / increase training error
add term weighted with a lambda
Additive regularization: L1 characteristics
enforce sparsity
focus on few key features
Additive regularization: L2 characteristics
enforce similar weights
consider all info
Additive regularization: L1 term
Additive regularization: L2 term
Last changeda year ago