07_DNN – Buffl

Buffl

KI Fahrzeugtechni

JJ

by Jensen J.

How does backpropagation work?

go upstream in computaional graph
-> basically, multiply the piecewise derivatives betwen basic operation blocks in the computation graph

How does the gradient behave when having a multiplication

switches to otehr factor
-> v = qz
-> dv/dq = z… -> switch…

How does the gardient behave when having an addition?

gradient remains the same…
e.g. q = x+y
dq/dy = 1…

What has to be considered during forward pass?

save local gradients during forward pass phase…
-> can efficiently be reused during backpropagation…

How are local gradietns graphically indicatred (in this lecture)?

right top -> funciton
left bottom -> gradient (derivative)

gradient of sigmoid?

y
-> y(1-y)

How to handle multiple upstream gradients?

sum of both…
=> have two outputs…

Difference CPU and GPU for training and inference in NN?

CPU:

complex logic control
low compute density
large cache
low latency tolerance

GPU:

optimized for parallel computing
high compute density
many calculation per memory access
high latency tolerance

Sigmoid function (graph, formula, derivative)?

problems sigmoid?

causes vanishing gradients
- gradients nearly 0 for large or small x
- kills gradient -> network stops to learn
output isnt zero centered
- always all gradients positive or negative -> inefficient weight updates…

Tanh function (graph, function and derivative)

Comparision tanh to sigmoid?

better than sigmoid
-> output is zero-centered
but still causes vanishing gradient

Relu (graph, formula and gradient)

Properties of ReLu?

most common activation function
- computationally efficient
- converges very fast
- Does not activate all neurons at the same time

Problems ReLu?

Gradient is zero for x<0 and can cause vanishing gradient → dead ReLUs may happen
Not zero-centered

Usage ReLu?

mostly in hidden layers

Leaky ReLu (graph, funciton and derivative)?

How does leaky relu improve regular relu?

Removes zero part of ReLU by adding a small slope. More stable then ReLU, but adds another parameter
Computationally efficient
Converges very fast
Doesn‘t die
Parameter a can also be learned by the network

Exponential Linear Unit (ELU) (garph, funciton and derivative)

Properties ELU=

Benefits of ReLU and Leaky ReLU
Computation requires e^x

Softmax (Graph, function)

What is softmax and when to use?

type of sigmoid function
handy for classification problems
divides by the sum of all outputs
- -> allows for percentage representation…

What is the rule of thumb for classifiers?

classifiers: Sigmoid/ReLu + Softmax
Sigmhid, TanH someteimes avoided due to vanishing gradients
ReLU mostly used today
start with ReLU. If you dont get optimal results -> go for leaky ReLU or ELU…

What does a fully connected layer compute?

What is a problem with Neural Networks w.r.t. putting images into it?

fixed input size
-> have to e.g. reshape / resample…
=> flatten image…

How can we perform gradient descend?

batch
- all data at once, several times
stochastic
- use one random data point at a time
mini-batch gradient descend
- split dataset into bacches -> use them

What are the effects of stochastic gradient descent?

reduces compute time per optimization step
fiunding local minima takes longer

How does mini batch gradient descent tryy to minimize the loss?

with local cost function (of the batch it has…)

Why do we do mini batch?

usually not possibnle to train over whole dataset (memory and computing power limitations)

stochastic vs mini batch?

stochastic:

can miss local minima because of randomness or each input
training takes longer in general

mini-batches

approach finds minimum quickly
but may need more computational power

How can we initialize weights? What are the results?

What is the pipeline to train a neural network?

prepare input data
Prepare the data-pipeline to get the data into the network
Define the network
Initialize all variables
Train the Network
Supervise the Training

Author

Jensen J.

Information

Last changed
2 years ago

© 2023 Buffl GmbH