Deep Learning

by Julia S.

What is Deep Learning and what is the difference to Machine Learning?

Deep Learning is a subset of Machine Learning which uses artificial neural networks to automatically learn complex patterns from large datasets. In Machine Learning, some steps like Feature Extraction need to be done manually, while in Deep Learning the Model learns to extract by itself.

List some application fields for Deep Learning.

Wind Turbine Detection / Car Detection in Aerial Images
Semantic Segmentation Aerial Images (Segmentation in water/road/vegeation/building etc.)
Building Footprints from Aerial Images
3D City Models from Point Clouds
Wind speed from GNSS Reflectometry

What are the types of Machine Learning algorithms? Explain them shortly.

supervised learning: usage of training data which is labelled to train the algorithm which creates a model (hypothesis). This model can then be used to predict the labels of unlabelled data.
unsupervised learning: The algorithm puts unlabelled data into clusters. It can’t predict the labels without training data, but can cluster the data by similarity. To label the data, it has to be interpreted manually.
reinforcement learning: With reinforcement learning, the algorithm learns from rewards of previous decisions when they were correct. That way, the algorithm develops a strategy to fulfill a goal.

What are the types of Machine Learning Problems? Explain them shortly. Which machine learning algorithms can be used to solve them?

Regression: Regression is a supervised learning problem. It uses the input data data to fit a hyperplane in space (2D case: line) which predicts values based on these input values. Therefore, the answer is a continuous value (e.g. y = 2x + 3). It estimates the relationship between two or more variables. Example: House prize vs. size. When, in the 2D case, the line is straight, this is called Linear Regression.
Classification: Classification is a supervised learning problem. It uses input data to fit a hyperplane in space which seperates two or more classes using a sigmoid logistic function. With this model, the class of new, unlabelled data can be predicted like benign vs. malignant tumors, predict soil type from data etc. The answer is discreet (x is either in class 1 or 2).
Segmentation: Segmentation is an unsupervised learning problem which provies a set of clusters. It groups similar data into clusters (the number of clusters can be determined or the algorithm determines them itself). It predicts properties by its closeness to the centroid of the clusters.

Write the sigmoid logistic function. What is it used for?

The sigmoid logistic function is:

and looks like a sideways S.

It is used for classification problems, for example for binary classification:

What is a neuron (in the context of Deep Learning)? How is it structured?

A neuron in the context of Deep Learning is the basic computational building block of a neural network and is inspired by biological neurons.

A neuron consists of (often several) inputs, which are multipled with weights according to their importance. These weighted inputs are then summed up in the “cell body” (together with a bias). This sum is then passed through an activation function to produce the neuron’s output.

What is an activation function?

Name a few.

An activation function is a nonlinear function inside the cell body of a neuron that introduces nonlinearity to the neural network, meaning that it allows said neural network to model non-linear relationships as well as linear relationships between the data. It also controls the output range of a neuron (like e.g. between 0 and 1).

The activation function is chosen by the designer of the network before training (so it is a hyperparameter, essentially).

Some activation fucntions are:

sigmoid function
tanh
ReLU
Linear
Softmax

What is the difference between a neuron and a perceptron?

A perceptron is a simplified type of neuron that is designed for binary classfification problems (like the first perceptron algorithm for letter recognization: Is it the letter B, or not the letter B?).

It uses the Heaviside step function H as activation function which outputs either 0 or 1.

It cannot model complex, nonlinear patterns on its own.

How is a neuronal network structured?

There are 3 layers: input layer, hidden layer and output layer.

The input layer receives the raw input data and passes these forward.

The hidden layers are between the input layer and the output layer and thus receive the inputs. There can be none or many hidden layers (if there are very many: = deep network) and consist of neurons which applies weights, bias and activation function to the inputs and passes the outputs on to the output layer (if there is only 1 hidden layer) or to the next hidden layer.

The output layer produces the network’s prediction or output.

What is a multilayer perceptron? Why is it used?

A multilayer perceptron is a type of neuronal network that consist of multiple hidden layers of neurons (not perceptrons!!) where every neuron of one layer is connected with all neurons of the next layer (= fully connected).

This helps with modeling nonlinear relationships as hidden layers can calculate complex functions better. An example of that would be the XOR problem.

How does a multiclass classfier with neurons work?

In a multiclass classfier, several neurons are grouped together in one hidden layer so that the output is not a scalar but a vector, and each output stands for the score of one class (therefore there is also one neuron for each to be predicted class). Then, the neuron with the highest score defines the predicted class.

What is the XOR problem?

XOR means that the output must be 1 if only one of the 2 inputs is 1. So (1|1) = 0, (0|0) = 0 but (1|0) = 1 and (0|1) = 1.

This is not linearily seperable with one-layer neural networks, because it has no linear decision boundary. Therefore, several neurons need to be used to model a more complex relationship, and there needs to be at least one hidden layer.

Which different structures can be used in neural networks? List three.

feedforward networks: every node feeds its output to the next. Examples: Sequential NN, CNN, simple MLP
residual networks: can skip connections which helps train very deep networks.
Recurrent Networks: Data can loop back to earlier layers, so the network has a “memory” of previous steps. Used for sequences like text or time series

Which steps are performed in the Neural Network training process?

weights initialization (e.g. with random gaussian noise)
forward pass (input goes through -> network gives prediction)
evaluation on how well the network performs (with a loss function)
backward pass (backpropagation; computes how each weight contributes to the error)
(stepwise) improvement of the weights (e.g. with gradient descent)

Is it a good idea to initialize the weights at 0? Why or why not?

If all weights are set to 0, every neuron in a layer will do the same thing.

This is called the symmetry problem.

If the weights are initialized to 0, the input vector will be the same and so will be the output of the activation function. Them, the gradient of the backpropagation will be the same. The network therefore behaves as if there was only one neuron.

So, it’s not a good idea; it’s better to initialize the weights randomly.

What loss function have we learned about?

Regression tasks:

Mean Square Error MSE
Root Mean Square Error RMSE
-> these two quantifies how far the predicted value is from the true value

Classification tasks:

Softmax Loss Function (also Categorical Cross-Entropy loss) -> for multi-class classification, computes the probability of all K classes and then compares these probability to the true class; penalizes model heavily when the correct class has a low probability

How does the Softmax loss function work? Give the equations.

The Softmax loss function consists of a softmax activation plus a cross-entropy loss.

The softmax activation function computes the probabilities of all k classes:

These probabilities are then put into the cross-entropy loss:

This function penalizes the model heavily when the probability for the correct class is low due to the -log function, which looks like this:

Therefore, if the prediction for the correct class is 1, the log function becomes 0 -> zero loss. Same if the wrong class is predicted as 1. However, when the probability of the correct class is < 1, the function penalizes the model more the lower the probability is.

Both euqations together, the softmax loss fucntion looks like this:

Since the target values for the incorrect classes are 0, only the logarithm remains from the summation. And since the target value for the correct class is 1, it’s also no longer needed in the equation.

The overall loss for the algorithm is the mean of the losses of all inputs.

What does the Gradient Descent algorithm do?

The Gradient Descent algorithm finds optimal weights by tweaking the model parameters in order to minimize the loss function.

How does the Gradient Descent algorithm work?

The Gradient Descent algorithm measures the local gradient of the loss fucntion and goes to the direction of a descending gradient with a user-determined step-size (learning rate) until a minimum is reached:

To measure the gradient vector, backpropagation is used for neural networks. This gradient vector is made from the partial derivatives of the loss function with respect to each model parameter.

How does Backpropagation work? Give the equations as well.

Backpropagation is used to compute how each weight contributes to the overall loss of a neural network.

This is done by taking the (partial) derivative of the loss function with respect to each weight/model parameter, starting with the last layer and moving backwards.

This derivative can be decomposed in several computations using the chain rule, since the local function of each node is just a small part of the larger loss function:

L = Loss

f(x,w) = local function

Using the chain rule:

The result of f(x,w) is known from the forward pass which saves all the outputs.

The same equation in a different notation:

with z = loss function, x = input/weight we want to compute the derivative of and y the local function f.

The first part dz/dx is also called the upstream gradient (partial derivative of loss function with respect to local function), while the second part dy/dx is called the local gradient (partial derivative of local function).

Which value does the upstream gradient at the loss function have?

Ath the loss, the upstream gradient is 1, because it’s the derivative of the loss function with respect to itself, since the upstream gradient caluclates the partial derivative of the loss function with respect to the local function, which, in case of the loss, is the loss function itself.

What influence does the learning rate have?

The learning rate is a hyperparameter which determines how big the step size for the gradient descent is. Is it’s to small, the algorithm takes a very long time to converge and might get stuck in a local minimum or at a plateau. If the learning rate is too big, the algorithm might overjump the minimum which leads to a diverging algorithm instead of a converging one.

To find a good learning rate, an optimization algorithm can be used.

Why is using a Momentum for a Descent Algorithm useful?

For stochastic gradient descent, the algorithm might get stuck in a local minimum or a plateau, and if the loss value has a steep slope in one direction, but a shallow slope in the other direction, stochastic gradient descent shows slow progress.

Using Mini-batches also causes noisy gradients.

A momentum is used to overcome local minima and saddle points by building up “velocity”, therefore “smoothing” the gradient updates over time and balancing out noisy gradients.

A new step is calculated like:

with m = momentum, beta = friction.

Here, the step size is multiplied with the momentum, which itself is calculated using the past momentum as well as the gradient.

Gradient descent with momentum goes faster and faster downhill until it reaches the bottom, then may overshoot the optimum a bit, comes back, overshoots again, and oscillates like this many times before stabilizing at the minimum. Friction gets rid of this oscillation and thus speeds up convergence

What is the Nesterov Momentum?

The Nesterov Momentum, also calles Nesterov Accelerated Gradient (NAG), measures the gradient of the cost function not at the local position, but slightly ahead in the direction of the momentum.

This means that the Nesterov momentum computes the gradients after the momentum step, while regular momentum calculates momentum after calculating the gradient. This regulates overshooting (less oscillation around minima) and results in a faster convergence (significantly faster than using regular momentum!)

What is the Vanishing Gradient Problem?

THe Vanishing Gradient problem occurs when the gradients become extremely small as they are propagated backwards through many layers.

Since during backpropagation, many derivatives are multiplied, if these derivatives are less then 1, repeatedly multiplying them causes the overall gradient to shrink exponentially towards the input layer.

As a result, the earlier layers learn very slowly or may stop learning entirely.

What can be done about the Vanishing Gradient Problem?

use different activation functions: ReLU helps with the problem, since its derivative equals 1 for positive inputs, unlike sigmoid and tanh activations, whose derivatives are much smaller
Better weight initialization (e.g. Xavier or He initialization)
Batch normalization reduces internal covariate shift (every layer constantly receives different data statistics (means, variances) from the layer before -> unstable and slower training) and stabilizes gradients
Residual connections allow gradients to skip layers, helping them flow backward more effectively

What is Covariate Shift?

When the input data distribution changes between training and testing, e.g. when learning on images for indoor cats and then testing it on images of cats outside)

What optimization methods did we learn about?

using momentum like Nesterov momentum
AdaGrad
RMSProp
Adam

What is AdaGrad and what does it do?

AdaGrad (Adaptative Gradient) is an optimization algorithm, meaning that it helps adjusting the learning rate.

Gradient Descent quickly goes down the steepest slope, but then continues only slowly towards the bottom. AdaGrad scales down the gradient vector along the steepest dimension. The key idea is: Parameters with frequent large gradients get smaller updates, while rare or sparse features get larger updates.

It does that by adapting the learning rate for each feature based on the historical gradients - it accumulates all past squared gradients.

It automatically decreases the learning rate over time and handles sparse data well (e.g., NLP tasks, text, or one-hot encoded features), but doesn’t handel dense data that well, where RMSProp or Adam work better.

AdaGrad often stops too early when training neural networks, because the learning rate gets scaled down close to zero before reaching the global optimum.

What is RMSProp and what does it do?

RMSProp (Root Mean Square Propagation) is an optimization algorithm which is an improved version of AdaGrad, since AdaGrad’s learning rate may get scaled to zero before reaching the global minimum.

RMSProp is designed in a way so that only recent gradients influence the learning rate, preventing it from shrinking too much by using an exponential decay.

It differs from AdaGrad only by the decay learning rate, but almost always performs much better than AdaGrad.

What is Adam and what does it do?

Adam (=Adaptive moment estimation) is the most popular optimization algorithm and combines RMSProp and momentum.

Adam computes adaptive learning rates for each parameter, like RMSProp, but also includes momentum, which smooths updates using past gradients.

Adam requires less tuning of the learning rate.

Variants are: AdaMax (more stable, but depends on dataset, but Adam performs better) and Nadam (Adam plus Nesterov trick, converges slightly faster, generally outperforms Adam)

What is a CNN?

A CNN (Convolutional Neural Network) type of deep learning model designed to process grid-like data — typically images, but also other spatially structured data (like satellite imagery, elevation grids, or rasterized maps).

Traditional neural networks treat each input feature seperately, but CNNs considers the relation between nearby pixels -> preserve spatial relationships. Additonally, CNNs can also handle multi-channel data, for example spectral bands from Sentinel-2, not just RGB.

What does a CNN consist of?

A Convolutional Neural network consists of

convolutional layers (applies set of filters onto input which extract a particular pattern and output feature maps which show where each pattern appears)
activation function
pooling layer (reduces spatial size and keeps only most important information)
fully connected layer (flattens the extracted features and feeds them intp a normal neural network layer for classification/regression)

What is a fully connected layer?

A fully connected layer connects every input neuron to every output neuron with a learnable weight: y = W*x + b.

x is a 1D vector here, which means that for images, the spatial structure is lost because the layer treats the input as a flat vector.

Since every input neuron is connected to every output neuron, there are many parameters involved, meaning for large inputs such as high resolution images (or even just 256x256 images) the network is very slow.

Why are CNNs better for images than fully connected networks?

Fully connected neural networks connect every input node to every output node and each connection adds a weight to each neuron, the network has a large network of weights and might be very slow for images, which have a large number of pixels.

Example: Image of size 256x256 = 65k neurons.

CNNs don’t look at the whole image at once, making it more efficient.

Additionally, fully connected neural networks take a 1D vector as input, meaning that the spatial structure is lost, while for CNNs, the original shape is kept.

What is a Convolutional Layer and what does it do?

A Convolutional Layer is used in CNNs to detect local patterns in the data (like edges, corners, textures, shapes etc.) using filters (=kernels), which slide over the image.

The filter size are configurable (chosen by the network architect), but they always extend over the full depth of the input volume (over all channels). This means that if an image is in RGB, the image has a depth of 3 and so does the filter.

As the filter slides over the image, it calculates the dot product between the part of the image overlaid by the filter and the filter weights. The filter weights depends on what the filter wants to detect, so an edge detector filter has different weights than another filter. Each filter activates strongly for a certain type of input patterns, resulting in high values in its output channel at the locations where those patterns are detected

Mathematically:

The convolution operator effectively computes a weighted linear combination of the inputs 𝑥 that are overlaid by the kernel filter having filter weights w. The output activation map has a depth of 1 as all channels are combined.

For an image with size 32x32x3, what would be the output size for filters of size F = 1, 3, 7, or 11?

What is the effect of a filter with size F = 32?

(With stride=1)

For a filter with size 1, the size of the output activation map would stay the same as no boundary elements are left out.

For a filter with size 3 (3x3), a boundary of size 1 is left out, therefore the output activation map has a size of 30x30.

For a filter with size 7, a boundary of size 3 is left out, therefore the output activation map has a size of 26x26.

For a filter with size 11, a boundary of size 5 is left out, therefore the output activation map has a size of 22x22.

A filter with size 32 would produce an output activation map of 1x1.

The relationship is: output height = input height - (filter height -1)

What is an Activation Map? How big is it?

An Activation map is the output of a filter in a convolutional layer.

The size of the output activation map depends on the input and filter size; for an 7x7 input with a 3x3 filter, the ouptut activation map has a size of 5x5 -> this is because filters with size > 1 can’t be applied to the boundary elements, however, there are methods to keep the output size.

If there are n filters, there are n (output) activation maps.

n filters produce n activation maps, which are put together in an “Activation volume” with n dimensions (6 filters produce an activation volume with n dimensions).

What is ReLU? Write down the equation.

ReLU (Rectified Linear Unit) is nowadays the most commonly used activation function for neural networks. It clamps negative input values to 0 and keeps positive vaöues the same.

ReLU(x) = max(0, x)

What is the stride of a filter?

The stride of a filter in a convolutional layer of a CNN determines how far (how many input cells) the filter is moved at each step (both in column and row direction).

Therefore, if the stride parameter is increasing, the size of the output activation map is decreasing.

output height = (input height - filter height)/stride + 1

(for stride > 1)

The step size and the filter size must be defined in such a way that the filter completely fits the input without omitting the last column or row -> the filter has to “see” everything -> the equation above must result in an int, not a decimal number

How can the size of the input be preserved in a CNN?

Normally, applying a filter decreases the size of the output activation map in a CNN.

However, with zero padding, the size of the input can be preserved.

To each side, (F-1)/2 pixels with value 0 are added (F = filter size), here in the image below with F = 3:

Imagine 10 filters with size 5x5x3 are applied to an image. How many parameters do all 10 filters together have?

5x5x3 + 1 = 76 (+1 for bias term)

then, 10 filters x 76 = 760 parameters

What is a 1x1 Convolution and why is it useful?

A 1x1 Convolution means that the filter has a size of 1x1xn (n=depth of the input layer). Therefore it performs a n-dimensional dot product.

Therefore, the filter doesn’t look at neighboring pixels - it only looks within the same pixel location across different feature channels. They act as a bottleneck layer, compressing many channels into fewer ones (combines feature info instead of spatial info), reducing computational cost while retaining important information.

This is used in Google’s Inception Networks, which use 1×1 convolutions to shrink data before expensive 3×3 or 5×5 convolutions.

What is a Pooling Layer and how does it work? Why is it used?

A Pooling Layer in a CNN is a layer which decreases the size of the activation columes. It is applied after activation volumes are created -> after a convolutional layer.

It decreases the size of the activation volume by operating over each activation map independently. It has no trainable parameters.

It is used to

reduce dimensionality -> makes network more efficient
reduce overfitting by summarizing information -> performs better on new, unseen data
extract translational invariant features -> network learns to recognize important features regardless of exact position in the input image
summatize information

What are the two most common pooling methods?

max pooling (takes the maximum of each pooling window) -> Helps to capture the most prominent features and reduces noise

average pooling (takes the average value from each pooling window) -> Smooths the features, which can be useful in certain applications

What is the receptive field in CNNs?

The receptive field of a neuron in a CNN is the region of the input image that can influence the value of that neuron.

So, for example, when the input image has a size of 32x32 and in the first convolutional layer, a 3x3 filter is applied, the receptive field is of size 3x3.

Now, what area of the input image does a second convolutional layer with filter size 3x3 see?

The formula is

So with every layer, the receptive field expands.

Example: A stack of 3x3 convolutional layers has an effective recepive field of 7x7.

List some common CNN architectures and say 1-2 things about them.

LeNet-5: The earliest CNN architecture, created for digit recognition
AlexNet: kicked of modern deep learning revolution, used ReLU activation for the first time; produces a large network with 60 million trainable parameters
VGG: deep architecture with 16-19 layers & more non-linearities than AlexNet, but less parameters
GoogLeNet/Inception: deeper than VGG with 22 layers with subnetworks (inception modules) that are stacked on top of each other
ResNet: very deep network (up to 154 layers), uses residual connections

What is the typical architecture of a CNN?

stack a few convolutional layers, each followed by ReLU activation layer
then a pooling layer to shrink the images
then a few fully connected layers with ReLU activation layers
finally, a softmax layer that outputs the class probabilities

What is Padding in the context of CNNs?

When you apply a convolution (for example a 3×3 filter) to an image, the filter cannot be centered on pixels at the edge without “falling off” the image.So normally, the output becomes smaller after each convolution.

This shrinking is a problem because too much shrinking can lead to information loss and very deep networks would shrink the image too much.

Padding adds extra pixels around the border of the input before applying the convolution so that the filter can slide over the edge of the “actual” input.

Zero Padding is Padding with pixel value 0. Example: Since most images have pixel values between 0 and 255, adding 0 around the edges is equivalent to adding a black border.

What is LeNet-5? How is it structured?

LeNet-5 is a CNN architecture that was created for handwritten digit recognition in 1998. Almost all modern architectures follow its footprint today.

The first layer consists of convolutional layers with a tanh activation (& 6 filters), then, average pooling follows which reduces spatial resolution and noise, also using a tanh activation function. Then, another convolutional layer with 16 filters followed, and then again an average pooling. The last convolutional layer comes after this and acts like a fully connected layer since the filter spans the entire spatial dimension (input is 5x5 and filter is as well).

After that, a fully connected layer is implemented.

What is AlexNet? Why was it revolutionary/special?

AlexNet was the first CNN to massively outperform classical methods on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012.

AlexNet was the first major model that used the ReLU activation function instead of tanh or sigmoid. This allowed for deeper networks and faster trainings as it solved the problem of vanishing gradients.

It consists of 8 layers (5 convolutional and 3 fully connected layers) with 60 million parameters. This required a huge computational effort for that time and therefore used GPUs for training, which was new.

How is AlexNet structured?

AlexNet consists of 8 layers (- input layer, - output layer).

Conv1 with 96 filters, a ReLU activation, max pooling and Local Response Normalization (LRN)
Conv2 with 256 filters, followed by pooling layer
3 Convolutional layers, followed by 1 pooling layer
3 fully connected layers
output layer (softmax)

What is VGG? Why is it special?

The VGG family includes VGG16 and VGG19 — named after the number of weight layers.

These networks were 2nd place in ImageNet 2014 but became much more influential than the winning model (GoogLeNet) because of their simplicity.

VGG demonstrated that CNNs become better by going deeper if consistent architecture choices are used and that there is no need to use big filters (VGG uses only small 3x3 filters) with stride 1 and padding 1). This leads to more non-linearities as there are more activation functions when stacking three 3x3 layers compared to using juts one 7x7 layer. It also leads to less parameters, meaning the network can be deeper but for the same computational cost.

After every 2-3 convolutional layers, a 2x2 max pooling layer with stride 2 is implemented, which halves the spatial resolution.

For its time, VGG was extremely deep, with VGG19 having 19 weight layers while AlexNet had 8.

The fully connected layers of VGG are very large, making VGG heavier overall than e.g. AlexNet (~ 140 million parameters), however, transfer learning is quite good with VGG.

What is GoogLeNet? Why is it special?

GoogLeNet was the winner of ImageNet 2014. It introduced the Inception module, a brand-new idea that changed how networks were constructed.

VGG proved that deeper networks work, but at a cost - a lot of parameters, very slow, etc. GoogLeNet solved this.

It is even deeper than VGG with 22 layers, but uses only 5-5 million parameters (since it uses almost no fully connected layers).

The Inception module processes the same input in parallel through multiple filter sizes and concatenates all filter outputs. These inception modules are then stacked with some pooling layers in between.

GoogLeNet also replaced large fully connected layers with global average pooling, which takes each feature map, averages it to a single number and feeds these numbers directly into softmax, which reduces parameters and is spatially interpretable compared to a fully connected layer.

What is Transfer Learning?

Transfer Learning is a technique where a model trained on one task (usually with lots of data) is partially or fully reused for a different but related task.

This is done by reusing most of the network and retrain only the last layers. For example, VGGs original fc8 (the last fully conn. layer) outputs 1000 ImageNet classes, but the new task might only have 20 classes. Therefore, the fc8 gets replaced accordingly, the rest stays the same, since e.g. the convolutional layers already learned general features such as edges and shapes.

This is useful because with Transfer LEarning, the training dataset can be much smaller, lots of parameters are frozen and since most of the network is already trained, the computational effort is much lower.

What is the Inception module? How is it structured?

The Inception module is an architectural block first used in GoogLeNet.

The idea is: Instead of choosing a single filter size, the network uses all of them in parallel.

It has 4 branches with 3 different filters (1x1, 3x3, 5x5): The 1x1 learns simple linear combinations, 3x3 learns medium-size features and 5x5 even larger features (for 5x5, the channels are reduced first by using 1x1 before). The 4th branch uses max pooling.

All 4 outputs (feature maps) are then concatenated along the channel dimension (so that there’s one feature map left with lots of channels).

What is Global Average Pooling?

Global Average Pooling GAP averages the values of each feature map to produce one scalar value per feature map. It is used in GoogLeNet.

Since the last convolutional layer outputs a feature map for each output category, GAP produces then one scalar value per output category, which can be interpreted as categoreis confidence maps/class activation maps (where the model “sees” the class in the picture, like a heatmap). These are then taken directly as input for the softmax function to calculate the class probabilities.

What is ResNet? Why is it special and how is it structured?

ResNet was introduced in 2015 and uses residual connections (= shortcut connections).

Residual connections between layers feed the (input) signal both into the next layer but also add it to the output of a layer located further in the network. This means that the network learns the residual between the input and output rather than the output directly. This allows the network to skip connections (= don’t change much, output is already good).

Example: When earlier layers already extracted strong edges and textures, the residual block only needs to tweak the features lightly, not learn everything from scratch. Without residuals, each layer is forced to transform its input. This also helps with vanishing gradients.

ResNet uses a large convolutional layer in the beginning to reduce image size, then stacks several residual blocks. Periodically, the number of filters is doubled while a stride of 2 is used to spatially downsample (more filters, more features can be seen at the same spatial area). In the end, global average pooling and a fully connected layer are used to output class scores. For very deep ResNets, a “bottleneck” layer is used to improve efficiency (by lowering the number of filters with a 1x1 convolution, then using more filters again, but with a bigger filter).

Is a deeper network always better?

No.

Stacking more and more convolutional layers in the conventional way doesn’t lead to better network performance, e.g., a 56-layer network performs worse on training and test error than a 20-layer-model. However, this is not caused by overfitting, but an optimization problem, since deeper models are harder to optimize. During backpropagation, the gradient gets multiplied over many layers for a deep network, leading to vanishing gradients/exploding gradients. Also, layers could destroy good features if the filters are not perfect yet.

Using ResNet solved this problem.

Compare VGG, ResNet, GoogLeNet an AlexNet. What are their strengths?

AlexNet has fewer computations, but uses much memory and has a lower accuracy.

VGG has the highest memory demands and does the most operations, but is more accurate than AlexNet.

GoogLeNet is the most efficient.

ResNet has moderate efficiency depending on the model, but the highest accuracy.

What is Batch Normalization?

Batch normalization is a technique used in deep learning to speed up and stabilize neural network training by normalizing the inputs to each layer within a mini-batch.

For each mini-batch, the mean and variance of the inputs to a layer are computed. These values are then used to normalize the inputs. A small value (𝜀) is added to the variance for numerical stability.

How do Residual Connections help build a better, deep network?

Imagine Y is the output, X is the input. F(X) is the correction the residual block applies to X.

In a conventional CNN, a filter is applied to an input, which then computes the output. During backpropagation, the weights of the filter get updated, so that during the next forward pass, the input gets changed accordingly so that it matches the output better.

In a ResNet, what the network learns is the residual between the input and the output. So the filter slides over the input, but its feature map gets added to the input again -> output. When the filter weights are adjusted during backpropagation, only the residual changes. Therefore, if the input is already very good, only small or no changes (=identity mapping) need to be made (similar to a painter only refining the parts of the painting that aren’t good yet, while in conventional CNNs, it would mean to redraw completely every time).

At the same time, during backpropagation, since gradients are computed using the chain rule, there are a lot of multiplications in very deep networks, making the gradient potentially very small by the time it reaches early layers (vanishing gradient problem, slower learning). A residual block changes this by multiplying with the derivative of the residual fucntion F’(X) + 1 -> the gradient flows directly also to earlier layers, even when F’(X) is small, allowing better learning even for early layers.

Why does stacking layers help with extracting features?

A small convolutional layer has a small receptive field. If several small conv. layers are stacked, the receptive field gets bigger and bigger, allowing the network to “see” object parts like e.g. wheels instead of just corners.

What does a Residual Block consist of in ResNet?

A residual block consists of two 3x3 convolutional layers with ReLU activation fucntions.

What are the 2 initialization methods we learned about?

Xavier initialization: designed for tanh/sigmoid activations. Formula: +- root(6/(nin+nout)) -> each weight in a filter is picked from the positive or negative value (nin = number of input neurons, nout = number of output neurons). Balacnes input + output, has a larger range
He (Kaiming) initialization: designed for ReLU activations. Formula: Normal distribution (0, 2/nin) -> each weight in a filter is picked from the positive or negative value. The weights need to be a bit bigger than for Xavier in., since ReLU zeros all of the negative inputs, so that the variance shrinks -> vanishing gradeint problem

Join Course

Preview

Author

Julia S.

Information

Last changed
13 days ago

Report course