What is the motivation to use RNN?
use on sequential data
-> take relation of data to previous data into consideration…
-> e.g. trajectory estimation…
How could we model a neural network with sequential input?
e.g. sliding window
input x1, x2, x3 -> y3
input x2, x3, x4 -> y4
input x3, x4, x5 -> y5…
How can we model a RNN?
have hidden states
-> output depends on hidden state
=> hidden states summarize the past sequence…
What are some use cases of RNNs?
Speech recognition and generation
Music recognition and generation
Translation
Image capturing
Video capturing
Prediction of movement of other traffic participants
Modeling dynamics of physical systems
How is the conection of dynamical systems ot RNN?
new state in sysetm
depends on previous state…
=> model with RNN…
Why RNN and not NN?
taking whole sequence as input to NN requires too many parameters…
=> RNN share parameters and add memory to capture important featuers of the past
How can RNN be interpreted?
state space model
-> with free parameters we want to learn…
What are the basic formulas of RNN?
hidden state at t depends on function of previous hidden state (t-1) and current input x(t)
output solely depends on the hidden state (function of hidden state)
What is the loss funciton in RNN?
sum over time-steps
-> basically sum over losses for each time step input-output relatinon…
What are the actual functions for hidden states and output in RNN?
How does backpropagation work in RNN?
we backpropagate throgh time
-> e.g. dL_t / dW
-> chain rule over previous hidden states…
dL_t/dW =
dL_t/dh_t * dh_t / dW
+
dL_t/dh_t * dh_t / dh_(t-1) * dh_(t-1)/ dW
…
until we reach h_0…
What is the general formula for BackProp in RNN?
Are there different weight matrices for the different layers?
no -> the same…
What is truncated backpropagation through time?
do not use full sequence for backprop
-> use chunk of unfolded graph (-> h(t) -> h(t+1) -> … h(t+tau)) of length tau
similar to minibatches in supervised learning…
What is a problem wiht truncated backprop through time?
biased…
-> gradient can differ
in e.g. regular NN -> different gradient average out
=> not given in RNN!!!!
Why is sequence truncation biased?
Why to do trunctaion sequences?
else would require lots and lots of memory
-> as gradient become increasingl long the longer the sequence (quadratic)
also not well conditioned
What are methods to choose the initial hidden state?
initialize as zero
most widely used
noisy zero mean
make robust against different initial values… (e.g. xavier)
trat as parameter to learn
use second neural network
What is teacher forcing?
in inference: RNN uses own output -> can accumulate error
-> teacher forcign: feed model ground truth instaed of own hidden state…
-> label should also be similar to output when model converges
=> due to this, we can simply do supervised learning with input x(t-1) and ŷ(t-1) with output (ŷ(t))…
state-space model with free parameters we want to learn
Advantage teacher forcingß
avoid error accumulation
-> no need to backpropagate
Can we do regularization using dropout on RNN?
limited
-> only on non-recurrent weights
else problematic as breaks continuity losing abilty to remembre things
What is the vanishing and exploding gradient problem?
e.g. hidden state as product of previous states
=> if hidden state around one -> limes of gradient is one
if -1 -> not specifid
ed
if > 1 -> explodes
if < -1 -> explodes (negatively)
if between 0 and 1 -> vanishes as goes to zero (limes)
=> probelm as we do not learn anymroe
How can we avoid vanishing and exploding gradients?
initialize W in the begiinning iwht good distribution
-> e.g. Xavier N(0, 1/n)
gradient clipping
||g|| > v
g <- v/||g|| * g
skip connections
allow gradient to travel over e.g. hidden states
=> gradient can bypass node…
use different RRN structures (e.g. with gates)
LSTM
GRU
Last changed2 years ago