Purpose of activation functions
introduce non-linearity for higher expressiveness/capacity and better learning
linear layers stacked on top of each other are just linear
Purpose of NN layers
easily viewed as matrices -> matrix calculations
helps with abstracting calculations (e.g. gradient descent)
Examples of activation functions
sigmoid
tanH
ReLU
Leaky ReLU
parametric ReLU
maxout
ELU
Computational graphs
compute nodes: matrix operations
vertex nodes: variable/operators
directed edges: flow of input
convolutional layers: extract useful features with shared weights
L1 Loss: formula
L1 Loss: characteristics
sum of absolute differences:
robust, but optimization costly
optimum = median
Mean-squared error (L2 Loss)
L2 loss: characteristics
sum of squared differences
prone to outliers
efficient optimization, optimum = mean
BCE
CE for multi-class
one-hot encoding
Why does the cross-entropy loss have a minus?
log(p) where p in [0,1] -> negative values that need to be turned around
Gradient Descent: learning formula
Backpropagation
start at last layer and use chain rule
Advantage of gradient descent/backprop
easily modularize complex functions
introduce non-linearity: more expressiveness/capacity and better learning
linear layers on top of each other are just linear
higher-level purpose: select key features
Last changeda year ago