How does backpropagation work?
go upstream in computaional graph
-> basically, multiply the piecewise derivatives betwen basic operation blocks in the computation graph
How does the gradient behave when having a multiplication
switches to otehr factor
-> v = qz
-> dv/dq = z… -> switch…
How does the gardient behave when having an addition?
gradient remains the same…
e.g. q = x+y
dq/dy = 1…
What has to be considered during forward pass?
save local gradients during forward pass phase…
-> can efficiently be reused during backpropagation…
How are local gradietns graphically indicatred (in this lecture)?
right top -> funciton
left bottom -> gradient (derivative)
gradient of sigmoid?
y
-> y(1-y)
How to handle multiple upstream gradients?
sum of both…
=> have two outputs…
Difference CPU and GPU for training and inference in NN?
CPU:
complex logic control
low compute density
large cache
low latency tolerance
GPU:
optimized for parallel computing
high compute density
many calculation per memory access
high latency tolerance
Sigmoid function (graph, formula, derivative)?
problems sigmoid?
causes vanishing gradients
gradients nearly 0 for large or small x
kills gradient -> network stops to learn
output isnt zero centered
always all gradients positive or negative -> inefficient weight updates…
Tanh function (graph, function and derivative)
Comparision tanh to sigmoid?
better than sigmoid
-> output is zero-centered
but still causes vanishing gradient
Relu (graph, formula and gradient)
Properties of ReLu?
most common activation function
computationally efficient
converges very fast
Does not activate all neurons at the same time
Problems ReLu?
Gradient is zero for x<0 and can cause vanishing gradient → dead ReLUs may happen
Not zero-centered
Usage ReLu?
mostly in hidden layers
Leaky ReLu (graph, funciton and derivative)?
How does leaky relu improve regular relu?
Removes zero part of ReLU by adding a small slope. More stable then ReLU, but adds another parameter
Computationally efficient
Converges very fast
Doesn‘t die
Parameter a can also be learned by the network
Exponential Linear Unit (ELU) (garph, funciton and derivative)
Properties ELU=
Benefits of ReLU and Leaky ReLU
Computation requires e^x
Softmax (Graph, function)
What is softmax and when to use?
type of sigmoid function
handy for classification problems
divides by the sum of all outputs
-> allows for percentage representation…
What is the rule of thumb for classifiers?
classifiers: Sigmoid/ReLu + Softmax
Sigmhid, TanH someteimes avoided due to vanishing gradients
ReLU mostly used today
start with ReLU. If you dont get optimal results -> go for leaky ReLU or ELU…
What does a fully connected layer compute?
What is a problem with Neural Networks w.r.t. putting images into it?
fixed input size
-> have to e.g. reshape / resample…
=> flatten image…
How can we perform gradient descend?
batch
all data at once, several times
stochastic
use one random data point at a time
mini-batch gradient descend
split dataset into bacches -> use them
What are the effects of stochastic gradient descent?
reduces compute time per optimization step
fiunding local minima takes longer
How does mini batch gradient descent tryy to minimize the loss?
with local cost function (of the batch it has…)
Why do we do mini batch?
usually not possibnle to train over whole dataset (memory and computing power limitations)
stochastic vs mini batch?
stochastic:
can miss local minima because of randomness or each input
training takes longer in general
mini-batches
approach finds minimum quickly
but may need more computational power
How can we initialize weights? What are the results?
What is the pipeline to train a neural network?
prepare input data
Prepare the data-pipeline to get the data into the network
Define the network
Initialize all variables
Train the Network
Supervise the Training
Last changed2 years ago