Deep Learning

von Julia S.

What is Deep Learning and what is the difference to Machine Learning?

Deep Learning is a subset of Machine Learning which uses artificial neural networks to automatically learn complex patterns from large datasets. In Machine Learning, some steps like Feature Extraction need to be done manually, while in Deep Learning the Model learns to extract by itself.

List some application fields for Deep Learning.

Wind Turbine Detection / Car Detection in Aerial Images
Semantic Segmentation Aerial Images (Segmentation in water/road/vegeation/building etc.)
Building Footprints from Aerial Images
3D City Models from Point Clouds
Wind speed from GNSS Reflectometry

What are the types of Machine Learning algorithms? Explain them shortly.

supervised learning: usage of training data which is labelled to train the algorithm which creates a model (hypothesis). This model can then be used to predict the labels of unlabelled data.
unsupervised learning: The algorithm puts unlabelled data into clusters. It can’t predict the labels without training data, but can cluster the data by similarity. To label the data, it has to be interpreted manually.
reinforcement learning: With reinforcement learning, the algorithm learns from rewards of previous decisions when they were correct. That way, the algorithm develops a strategy to fulfill a goal.

What are the types of Machine Learning Problems? Explain them shortly. Which machine learning algorithms can be used to solve them?

Regression: Regression is a supervised learning problem. It uses the input data data to fit a hyperplane in space (2D case: line) which predicts values based on these input values. Therefore, the answer is a continuous value (e.g. y = 2x + 3). It estimates the relationship between two or more variables. Example: House prize vs. size. When, in the 2D case, the line is straight, this is called Linear Regression.
Classification: Classification is a supervised learning problem. It uses input data to fit a hyperplane in space which seperates two or more classes using a sigmoid logistic function. With this model, the class of new, unlabelled data can be predicted like benign vs. malignant tumors, predict soil type from data etc. The answer is discreet (x is either in class 1 or 2).
Segmentation: Segmentation is an unsupervised learning problem which provies a set of clusters. It groups similar data into clusters (the number of clusters can be determined or the algorithm determines them itself). It predicts properties by its closeness to the centroid of the clusters.

Write the sigmoid logistic function. What is it used for?

The sigmoid logistic function is:

and looks like a sideways S.

It is used for classification problems, for example for binary classification:

What is a neuron (in the context of Deep Learning)? How is it structured?

A neuron in the context of Deep Learning is the basic computational building block of a neural network and is inspired by biological neurons.

A neuron consists of (often several) inputs, which are multipled with weights according to their importance. These weighted inputs are then summed up in the “cell body” (together with a bias). This sum is then passed through an activation function to produce the neuron’s output.

What is an activation function?

Name a few.

An activation function is a nonlinear function inside the cell body of a neuron that introduces nonlinearity to the neural network, meaning that it allows said neural network to model non-linear relationships as well as linear relationships between the data. It also controls the output range of a neuron (like e.g. between 0 and 1).

The activation function is chosen by the designer of the network before training (so it is a hyperparameter, essentially).

Some activation fucntions are:

sigmoid function
tanh
ReLU
Linear
Softmax

What is the difference between a neuron and a perceptron?

A perceptron is a simplified type of neuron that is designed for binary classification problems (like the first perceptron algorithm for letter recognization: Is it the letter B, or not the letter B?).

It uses the Heaviside step function H as activation function which outputs either 0 or 1.

It cannot model complex, nonlinear patterns on its own.

How is a neuronal network structured?

There are 3 layers: input layer, hidden layer and output layer.

The input layer receives the raw input data and passes these forward.

The hidden layers are between the input layer and the output layer and thus receive the inputs. There can be none or many hidden layers (if there are very many: = deep network) and consist of neurons which applies weights, bias and activation function to the inputs and passes the outputs on to the output layer (if there is only 1 hidden layer) or to the next hidden layer.

The output layer produces the network’s prediction or output.

What is a multilayer perceptron? Why is it used?

A multilayer perceptron is a type of neuronal network that consist of multiple hidden layers of neurons (not perceptrons!!) where every neuron of one layer is connected with all neurons of the next layer (= fully connected).

This helps with modeling nonlinear relationships as hidden layers can calculate complex functions better. An example of that would be the XOR problem.

How does a multiclass classfier with neurons work?

In a multiclass classfier, several neurons are grouped together in one hidden layer so that the output is not a scalar but a vector, and each output stands for the score of one class (therefore there is also one neuron for each to be predicted class). Then, the neuron with the highest score defines the predicted class.

What is the XOR problem?

XOR means that the output must be 1 if only one of the 2 inputs is 1. So (1|1) = 0, (0|0) = 0 but (1|0) = 1 and (0|1) = 1.

This is not linearily seperable with one-layer neural networks, because it has no linear decision boundary. Therefore, several neurons need to be used to model a more complex relationship, and there needs to be at least one hidden layer.

Which different structures can be used in neural networks? List three.

feedforward networks: every node feeds its output to the next. Examples: Sequential NN, CNN, simple MLP
residual networks: can skip connections which helps train very deep networks.
Recurrent Networks: Data can loop back to earlier layers, so the network has a “memory” of previous steps. Used for sequences like text or time series

Which steps are performed in the Neural Network training process?

weights initialization (e.g. with random gaussian noise)
forward pass (input goes through -> network gives prediction)
evaluation on how well the network performs (with a loss function)
backward pass (backpropagation; computes how each weight contributes to the error)
(stepwise) improvement of the weights (e.g. with gradient descent)

Is it a good idea to initialize the weights at 0? Why or why not?

If all weights are set to 0, every neuron in a layer will do the same thing.

This is called the symmetry problem.

If the weights are initialized to 0, the input vector will be the same and so will be the output of the activation function. Then, the gradient of the backpropagation will be the same. The network therefore behaves as if there was only one neuron.

So, it’s not a good idea; it’s better to initialize the weights randomly.

What loss function have we learned about?

Regression tasks:

Mean Square Error MSE
Root Mean Square Error RMSE
-> these two quantifies how far the predicted value is from the true value

Classification tasks:

Softmax Loss Function (also Categorical Cross-Entropy loss) -> for multi-class classification, computes the probability of all K classes and then compares these probability to the true class; penalizes model heavily when the correct class has a low probability

How does the Softmax loss function work? Give the equations.

The Softmax loss function consists of a softmax activation plus a cross-entropy loss.

The softmax activation function computes the probabilities of all k classes:

These probabilities are then put into the cross-entropy loss:

This function penalizes the model heavily when the probability for the correct class is low due to the -log function, which looks like this:

Therefore, if the prediction for the correct class is 1, the log function becomes 0 -> zero loss. Same if the wrong class is predicted as 1. However, when the probability of the correct class is < 1, the function penalizes the model more the lower the probability is.

Both euqations together, the softmax loss fucntion looks like this:

Since the target values for the incorrect classes are 0, only the logarithm remains from the summation. And since the target value for the correct class is 1, it’s also no longer needed in the equation.

The overall loss for the algorithm is the mean of the losses of all inputs.

What does the Gradient Descent algorithm do?

The Gradient Descent algorithm finds optimal weights by tweaking the model parameters in order to minimize the loss function.

How does the Gradient Descent algorithm work?

The Gradient Descent algorithm measures the local gradient of the loss fucntion and goes in the direction of a descending gradient with a user-determined step-size (learning rate) until a minimum is reached:

To measure the gradient vector, backpropagation is used for neural networks. This gradient vector is made from the partial derivatives of the loss function with respect to each model parameter.

How does Backpropagation work? Give the equations as well.

Backpropagation is used to compute how each weight contributes to the overall loss of a neural network.

This is done by taking the (partial) derivative of the loss function with respect to each weight/model parameter, starting with the last layer and moving backwards.

This derivative can be decomposed in several computations using the chain rule, since the local function of each node is just a small part of the larger loss function:

L = Loss

f(x,w) = local function

Using the chain rule:

The result of f(x,w) is known from the forward pass which saves all the outputs.

The same equation in a different notation:

with z = loss function, x = input/weight we want to compute the derivative of and y the local function f.

The first part dz/dx is also called the upstream gradient (partial derivative of loss function with respect to local function), while the second part dy/dx is called the local gradient (partial derivative of local function).

Which value does the upstream gradient at the loss function have?

Ath the loss, the upstream gradient is 1, because it’s the derivative of the loss function with respect to itself, since the upstream gradient caluclates the partial derivative of the loss function with respect to the local function, which, in case of the loss, is the loss function itself.

What influence does the learning rate have?

The learning rate is a hyperparameter which determines how big the step size for the gradient descent is. Is it’s to small, the algorithm takes a very long time to converge and might get stuck in a local minimum or at a plateau. If the learning rate is too big, the algorithm might overjump the minimum which leads to a diverging algorithm instead of a converging one.

To find a good learning rate, an optimization algorithm can be used.

Why is using a Momentum for a Descent Algorithm useful?

For stochastic gradient descent, the algorithm might get stuck in a local minimum or a plateau, and if the loss value has a steep slope in one direction, but a shallow slope in the other direction, stochastic gradient descent shows slow progress.

Using Mini-batches also causes noisy gradients.

A momentum is used to overcome local minima and saddle points by building up “velocity”, therefore “smoothing” the gradient updates over time and balancing out noisy gradients.

A new step is calculated like:

with m = momentum, beta = friction.

Here, the step size is multiplied with the momentum, which itself is calculated using the past momentum as well as the gradient.

Gradient descent with momentum goes faster and faster downhill until it reaches the bottom, then may overshoot the optimum a bit, comes back, overshoots again, and oscillates like this many times before stabilizing at the minimum. Friction gets rid of this oscillation and thus speeds up convergence

What is the Nesterov Momentum?

The Nesterov Momentum, also calles Nesterov Accelerated Gradient (NAG), measures the gradient of the cost function not at the local position, but slightly ahead in the direction of the momentum.

This means that the Nesterov momentum computes the gradients after the momentum step, while regular momentum calculates momentum after calculating the gradient. This regulates overshooting (less oscillation around minima) and results in a faster convergence (significantly faster than using regular momentum!)

What is the Vanishing Gradient Problem?

THe Vanishing Gradient problem occurs when the gradients become extremely small as they are propagated backwards through many layers.

Since during backpropagation, many derivatives are multiplied, if these derivatives are less then 1, repeatedly multiplying them causes the overall gradient to shrink exponentially towards the input layer.

As a result, the earlier layers learn very slowly or may stop learning entirely.

What can be done about the Vanishing Gradient Problem?

use different activation functions: ReLU helps with the problem, since its derivative equals 1 for positive inputs, unlike sigmoid and tanh activations, whose derivatives are much smaller
Better weight initialization (e.g. Xavier or He initialization)
Batch normalization reduces internal covariate shift (every layer constantly receives different data statistics (means, variances) from the layer before -> unstable and slower training) and stabilizes gradients
Residual connections allow gradients to skip layers, helping them flow backward more effectively

What is Covariate Shift?

When the input data distribution changes between training and testing, e.g. when learning on images for indoor cats and then testing it on images of cats outside)

What optimization methods did we learn about?

using momentum like Nesterov momentum
AdaGrad
RMSProp
Adam

What is AdaGrad and what does it do?

AdaGrad (Adaptative Gradient) is an optimization algorithm, meaning that it helps adjusting the learning rate.

Gradient Descent quickly goes down the steepest slope, but then continues only slowly towards the bottom. AdaGrad scales down the gradient vector along the steepest dimension. The key idea is: Parameters with frequent large gradients get smaller updates, while rare or sparse features get larger updates.

It does that by adapting the learning rate for each feature based on the historical gradients - it accumulates all past squared gradients.

It automatically decreases the learning rate over time and handles sparse data well (e.g., NLP tasks, text, or one-hot encoded features), but doesn’t handel dense data that well, where RMSProp or Adam work better.

AdaGrad often stops too early when training neural networks, because the learning rate gets scaled down close to zero before reaching the global optimum.

What is RMSProp and what does it do?

RMSProp (Root Mean Square Propagation) is an optimization algorithm which is an improved version of AdaGrad, since AdaGrad’s learning rate may get scaled to zero before reaching the global minimum.

RMSProp is designed in a way so that only recent gradients influence the learning rate, preventing it from shrinking too much by using an exponential decay.

It differs from AdaGrad only by the decay learning rate, but almost always performs much better than AdaGrad.

What is Adam and what does it do?

Adam (=Adaptive moment estimation) is the most popular optimization algorithm and combines RMSProp and momentum.

Adam computes adaptive learning rates for each parameter, like RMSProp, but also includes momentum, which smooths updates using past gradients.

Adam requires less tuning of the learning rate.

Variants are: AdaMax (more stable, but depends on dataset, but Adam performs better) and Nadam (Adam plus Nesterov trick, converges slightly faster, generally outperforms Adam)

What is a CNN?

A CNN (Convolutional Neural Network) type of deep learning model designed to process grid-like data — typically images, but also other spatially structured data (like satellite imagery, elevation grids, or rasterized maps).

Traditional neural networks treat each input feature seperately, but CNNs considers the relation between nearby pixels -> preserve spatial relationships. Additonally, CNNs can also handle multi-channel data, for example spectral bands from Sentinel-2, not just RGB.

What does a CNN consist of?

A Convolutional Neural network consists of

convolutional layers (applies set of filters onto input which extract a particular pattern and output feature maps which show where each pattern appears)
activation function
pooling layer (reduces spatial size and keeps only most important information)
fully connected layer (flattens the extracted features and feeds them intp a normal neural network layer for classification/regression)

What is a fully connected layer?

A fully connected layer connects every input neuron to every output neuron with a learnable weight: y = W*x + b.

x is a 1D vector here, which means that for images, the spatial structure is lost because the layer treats the input as a flat vector.

Since every input neuron is connected to every output neuron, there are many parameters involved, meaning for large inputs such as high resolution images (or even just 256x256 images) the network is very slow.

Why are CNNs better for images than fully connected networks?

Fully connected neural networks connect every input node to every output node and each connection adds a weight to each neuron, the network has a large network of weights and might be very slow for images, which have a large number of pixels.

Example: Image of size 256x256 = 65k neurons.

CNNs don’t look at the whole image at once, making it more efficient.

Additionally, fully connected neural networks take a 1D vector as input, meaning that the spatial structure is lost, while for CNNs, the original shape is kept.

What is a Convolutional Layer and what does it do?

A Convolutional Layer is used in CNNs to detect local patterns in the data (like edges, corners, textures, shapes etc.) using filters (=kernels), which slide over the image.

The filter size are configurable (chosen by the network architect), but they always extend over the full depth of the input volume (over all channels). This means that if an image is in RGB, the image has a depth of 3 and so does the filter.

As the filter slides over the image, it calculates the dot product between the part of the image overlaid by the filter and the filter weights. The filter weights depends on what the filter wants to detect, so an edge detector filter has different weights than another filter. Each filter activates strongly for a certain type of input patterns, resulting in high values in its output channel at the locations where those patterns are detected

Mathematically:

The convolution operator effectively computes a weighted linear combination of the inputs 𝑥 that are overlaid by the kernel filter having filter weights w. The output activation map has a depth of 1 as all channels are combined.

For an image with size 32x32x3, what would be the output size for filters of size F = 1, 3, 7, or 11?

What is the effect of a filter with size F = 32?

(With stride=1)

For a filter with size 1, the size of the output activation map would stay the same as no boundary elements are left out.

For a filter with size 3 (3x3), a boundary of size 1 is left out, therefore the output activation map has a size of 30x30.

For a filter with size 7, a boundary of size 3 is left out, therefore the output activation map has a size of 26x26.

For a filter with size 11, a boundary of size 5 is left out, therefore the output activation map has a size of 22x22.

A filter with size 32 would produce an output activation map of 1x1.

The relationship is: output height = input height - (filter height +1)

What is an Activation Map? How big is it?

An Activation map is the output of a filter in a convolutional layer.

The size of the output activation map depends on the input and filter size; for an 7x7 input with a 3x3 filter, the ouptut activation map has a size of 5x5 -> this is because filters with size > 1 can’t be applied to the boundary elements, however, there are methods to keep the output size.

If there are n filters, there are n (output) activation maps.

n filters produce n activation maps, which are put together in an “Activation volume” with n dimensions (6 filters produce an activation volume with n dimensions).

What is ReLU? Write down the equation.

ReLU (Rectified Linear Unit) is nowadays the most commonly used activation function for neural networks. It clamps negative input values to 0 and keeps positive vaöues the same.

ReLU(x) = max(0, x)

What is the stride of a filter?

The stride of a filter in a convolutional layer of a CNN determines how far (how many input cells) the filter is moved at each step (both in column and row direction).

Therefore, if the stride parameter is increasing, the size of the output activation map is decreasing.

output height = (input height - filter height)/stride + 1

(for stride > 1)

The step size and the filter size must be defined in such a way that the filter completely fits the input without omitting the last column or row -> the filter has to “see” everything -> the equation above must result in an int, not a decimal number

How can the size of the input be preserved in a CNN?

Normally, applying a filter decreases the size of the output activation map in a CNN.

However, with zero padding, the size of the input can be preserved.

To each side, (F-1)/2 pixels with value 0 are added (F = filter size), here in the image below with F = 3:

Imagine 10 filters with size 5x5x3 are applied to an image. How many parameters do all 10 filters together have?

5x5x3 + 1 = 76 (+1 for bias term)

then, 10 filters x 76 = 760 parameters

What is a 1x1 Convolution and why is it useful?

A 1x1 Convolution means that the filter has a size of 1x1xn (n=depth of the input layer). Therefore it performs a n-dimensional dot product.

Therefore, the filter doesn’t look at neighboring pixels - it only looks within the same pixel location across different feature channels. They act as a bottleneck layer, compressing many channels into fewer ones (combines feature info instead of spatial info), reducing computational cost while retaining important information.

This is used in Google’s Inception Networks, which use 1×1 convolutions to shrink data before expensive 3×3 or 5×5 convolutions.

What is a Pooling Layer and how does it work? Why is it used?

A Pooling Layer in a CNN is a layer which decreases the size of the activation columes. It is applied after activation volumes are created -> after a convolutional layer.

It decreases the size of the activation volume by operating over each activation map independently. It has no trainable parameters.

It is used to

reduce dimensionality -> makes network more efficient
reduce overfitting by summarizing information -> performs better on new, unseen data
extract translational invariant features -> network learns to recognize important features regardless of exact position in the input image
summatize information

What are the two most common pooling methods?

max pooling (takes the maximum of each pooling window) -> Helps to capture the most prominent features and reduces noise

average pooling (takes the average value from each pooling window) -> Smooths the features, which can be useful in certain applications

What is the receptive field in CNNs?

The receptive field of a neuron in a CNN is the region of the input image that can influence the value of that neuron.

So, for example, when the input image has a size of 32x32 and in the first convolutional layer, a 3x3 filter is applied, the receptive field is of size 3x3.

Now, what area of the input image does a second convolutional layer with filter size 3x3 see?

The formula is

So with every layer, the receptive field expands.

Example: A stack of 3x3 convolutional layers has an effective recepive field of 7x7.

List some common CNN architectures and say 1-2 things about them.

LeNet-5: The earliest CNN architecture, created for digit recognition
AlexNet: kicked of modern deep learning revolution, used ReLU activation for the first time; produces a large network with 60 million trainable parameters
VGG: deep architecture with 16-19 layers & more non-linearities than AlexNet, but less parameters
GoogLeNet/Inception: deeper than VGG with 22 layers with subnetworks (inception modules) that are stacked on top of each other
ResNet: very deep network (up to 154 layers), uses residual connections

What is the typical architecture of a CNN?

stack a few convolutional layers, each followed by ReLU activation layer
then a pooling layer to shrink the images
then a few fully connected layers with ReLU activation layers
finally, a softmax layer that outputs the class probabilities

What is Padding in the context of CNNs?

When you apply a convolution (for example a 3×3 filter) to an image, the filter cannot be centered on pixels at the edge without “falling off” the image.So normally, the output becomes smaller after each convolution.

This shrinking is a problem because too much shrinking can lead to information loss and very deep networks would shrink the image too much.

Padding adds extra pixels around the border of the input before applying the convolution so that the filter can slide over the edge of the “actual” input.

Zero Padding is Padding with pixel value 0. Example: Since most images have pixel values between 0 and 255, adding 0 around the edges is equivalent to adding a black border.

What is LeNet-5? How is it structured?

LeNet-5 is a CNN architecture that was created for handwritten digit recognition in 1998. Almost all modern architectures follow its footprint today.

The first layer consists of convolutional layers with a tanh activation (& 6 filters), then, average pooling follows which reduces spatial resolution and noise, also using a tanh activation function. Then, another convolutional layer with 16 filters followed, and then again an average pooling. The last convolutional layer comes after this and acts like a fully connected layer since the filter spans the entire spatial dimension (input is 5x5 and filter is as well).

After that, a fully connected layer is implemented.

What is AlexNet? Why was it revolutionary/special?

AlexNet was the first CNN to massively outperform classical methods on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012.

AlexNet was the first major model that used the ReLU activation function instead of tanh or sigmoid. This allowed for deeper networks and faster trainings as it solved the problem of vanishing gradients.

It consists of 8 layers (5 convolutional and 3 fully connected layers) with 60 million parameters. This required a huge computational effort for that time and therefore used GPUs for training, which was new.

How is AlexNet structured?

AlexNet consists of 8 layers (- input layer, - output layer).

Conv1 with 96 filters, a ReLU activation, max pooling and Local Response Normalization (LRN)
Conv2 with 256 filters, followed by pooling layer
3 Convolutional layers, followed by 1 pooling layer
3 fully connected layers
output layer (softmax)

What is VGG? Why is it special?

The VGG family includes VGG16 and VGG19 — named after the number of weight layers.

These networks were 2nd place in ImageNet 2014 but became much more influential than the winning model (GoogLeNet) because of their simplicity.

VGG demonstrated that CNNs become better by going deeper if consistent architecture choices are used and that there is no need to use big filters (VGG uses only small 3x3 filters) with stride 1 and padding 1). This leads to more non-linearities as there are more activation functions when stacking three 3x3 layers compared to using juts one 7x7 layer. It also leads to less parameters, meaning the network can be deeper but for the same computational cost.

After every 2-3 convolutional layers, a 2x2 max pooling layer with stride 2 is implemented, which halves the spatial resolution.

For its time, VGG was extremely deep, with VGG19 having 19 weight layers while AlexNet had 8.

The fully connected layers of VGG are very large, making VGG heavier overall than e.g. AlexNet (~ 140 million parameters), however, transfer learning is quite good with VGG.

What is GoogLeNet? Why is it special?

GoogLeNet was the winner of ImageNet 2014. It introduced the Inception module, a brand-new idea that changed how networks were constructed.

VGG proved that deeper networks work, but at a cost - a lot of parameters, very slow, etc. GoogLeNet solved this.

It is even deeper than VGG with 22 layers, but uses only 5-5 million parameters (since it uses almost no fully connected layers).

The Inception module processes the same input in parallel through multiple filter sizes and concatenates all filter outputs. These inception modules are then stacked with some pooling layers in between.

GoogLeNet also replaced large fully connected layers with global average pooling, which takes each feature map, averages it to a single number and feeds these numbers directly into softmax, which reduces parameters and is spatially interpretable compared to a fully connected layer.

What is Transfer Learning?

Transfer Learning is a technique where a model trained on one task (usually with lots of data) is partially or fully reused for a different but related task.

This is done by reusing most of the network and retrain only the last layers. For example, VGGs original fc8 (the last fully conn. layer) outputs 1000 ImageNet classes, but the new task might only have 20 classes. Therefore, the fc8 gets replaced accordingly, the rest stays the same, since e.g. the convolutional layers already learned general features such as edges and shapes.

This is useful because with Transfer LEarning, the training dataset can be much smaller, lots of parameters are frozen and since most of the network is already trained, the computational effort is much lower.

What is the Inception module? How is it structured?

The Inception module is an architectural block first used in GoogLeNet.

The idea is: Instead of choosing a single filter size, the network uses all of them in parallel.

It has 4 branches with 3 different filters (1x1, 3x3, 5x5): The 1x1 learns simple linear combinations, 3x3 learns medium-size features and 5x5 even larger features (for 5x5, the channels are reduced first by using 1x1 before). The 4th branch uses max pooling.

All 4 outputs (feature maps) are then concatenated along the channel dimension (so that there’s one feature map left with lots of channels).

What is Global Average Pooling?

Global Average Pooling GAP averages the values of each feature map to produce one scalar value per feature map. It is used in GoogLeNet.

Since the last convolutional layer outputs a feature map for each output category, GAP produces then one scalar value per output category, which can be interpreted as categoreis confidence maps/class activation maps (where the model “sees” the class in the picture, like a heatmap). These are then taken directly as input for the softmax function to calculate the class probabilities.

What is ResNet? Why is it special and how is it structured?

ResNet was introduced in 2015 and uses residual connections (= shortcut connections).

Residual connections between layers feed the (input) signal both into the next layer but also add it to the output of a layer located further in the network. This means that the network learns the residual between the input and output rather than the output directly. This allows the network to skip connections (= don’t change much, output is already good).

Example: When earlier layers already extracted strong edges and textures, the residual block only needs to tweak the features lightly, not learn everything from scratch. Without residuals, each layer is forced to transform its input. This also helps with vanishing gradients.

ResNet uses a large convolutional layer in the beginning to reduce image size, then stacks several residual blocks. Periodically, the number of filters is doubled while a stride of 2 is used to spatially downsample (more filters, more features can be seen at the same spatial area). In the end, global average pooling and a fully connected layer are used to output class scores. For very deep ResNets, a “bottleneck” layer is used to improve efficiency (by lowering the number of filters with a 1x1 convolution, then using more filters again, but with a bigger filter).

Is a deeper network always better?

No.

Stacking more and more convolutional layers in the conventional way doesn’t lead to better network performance, e.g., a 56-layer network performs worse on training and test error than a 20-layer-model. However, this is not caused by overfitting, but an optimization problem, since deeper models are harder to optimize. During backpropagation, the gradient gets multiplied over many layers for a deep network, leading to vanishing gradients/exploding gradients. Also, layers could destroy good features if the filters are not perfect yet.

Using ResNet solved this problem.

Compare VGG, ResNet, GoogLeNet an AlexNet. What are their strengths?

AlexNet has fewer computations, but uses much memory and has a lower accuracy.

VGG has the highest memory demands and does the most operations, but is more accurate than AlexNet.

GoogLeNet is the most efficient.

ResNet has moderate efficiency depending on the model, but the highest accuracy.

What is Batch Normalization?

Batch normalization is a technique used in deep learning to speed up and stabilize neural network training by normalizing the inputs to each layer within a mini-batch.

For each mini-batch, the mean and variance of the inputs to a layer are computed. These values are then used to normalize the inputs. A small value (𝜀) is added to the variance for numerical stability.

Batch Normalization lets the neural network learn the optimal scale (γ) and mean (β) of each of the layer’s inputs.

Batch Normalization is typically added just before the activation function of each hidden layer, but can also be added after activation function.

How do Residual Connections help build a better, deep network?

Imagine Y is the output, X is the input. F(X) is the correction the residual block applies to X.

In a conventional CNN, a filter is applied to an input, which then computes the output. During backpropagation, the weights of the filter get updated, so that during the next forward pass, the input gets changed accordingly so that it matches the output better.

In a ResNet, what the network learns is the residual between the input and the output. So the filter slides over the input, but its feature map gets added to the input again -> output. When the filter weights are adjusted during backpropagation, only the residual changes. Therefore, if the input is already very good, only small or no changes (=identity mapping) need to be made (similar to a painter only refining the parts of the painting that aren’t good yet, while in conventional CNNs, it would mean to redraw completely every time).

At the same time, during backpropagation, since gradients are computed using the chain rule, there are a lot of multiplications in very deep networks, making the gradient potentially very small by the time it reaches early layers (vanishing gradient problem, slower learning). A residual block changes this by multiplying with the derivative of the residual fucntion F’(X) + 1 -> the gradient flows directly also to earlier layers, even when F’(X) is small, allowing better learning even for early layers.

Why does stacking layers help with extracting features?

A small convolutional layer has a small receptive field. If several small conv. layers are stacked, the receptive field gets bigger and bigger, allowing the network to “see” object parts like e.g. wheels instead of just corners.

What does a Residual Block consist of in ResNet?

A residual block consists of two 3x3 convolutional layers with ReLU activation fucntions.

What are the 2 initialization methods we learned about?

Xavier initialization: designed for tanh/sigmoid activations. Formula: +- root(6/(nin+nout)) -> each weight in a filter is picked from the positive or negative value (nin = number of input neurons, nout = number of output neurons). Balacnes input + output, has a larger range
He (Kaiming) initialization: designed for ReLU activations. Formula: Normal distribution (0, 2/nin) -> each weight in a filter is picked from the positive or negative value. The weights need to be a bit bigger than for Xavier in., since ReLU zeros all of the negative inputs, so that the variance shrinks -> vanishing gradeint problem

What is Semantic Segmentation?

Semantic Segmentation is the task to label each pixel in the image with a category label, but there’s so disatinction between object instances of the same class (e.g. if there are 2 cows in an image, they only get 1 label).

How does the Sliding Window approach work for Semantic Segmentation and is it useful?

With the Sliding Window approach, a window “slides” over the image, so that every center pixel in that window is classified using a CNN. This is very inefficient as many CNNs need to be applied & shared features between overlapping patches are not reused.

What is a Fully Convolutional Network?

A Fully Convolutional Network (FCN) is a type of CNN which is built entirely from convolutional layers without any fully connected layers (Traditional CNNs: Convolution -> Pooling -> … -> Fully connected layer -> class label).

An FCN removes the fully connected (dense) layers and replaces them with convolutions.

Instead of predicting one label per image, an FCN predicts one label per pixel (or per spatial location). The fully connected layer is replaced by a 1x1 convolution which acts like a per-pixel classifier & preserves spatial structure.

What is the advantage of a convolutional layer over a fully connected (dense) layer?

A convolutional layer can be applied on images and activation volumes of any input size, because it is not dependent on the input size, but only on (the size of the filter and) the number of input channels.

A dense layer, on the other side, is dependent on the input size, as each neuron has as many weights as inputs.

For Semantic Segmentation, how are FCNs designed?

FCNs don’t have fully connected layers in the end for image classification the way CNNs do. Instead, they have a 1x1 convolutional layer, which classifies each pixel based on all channels.

Since convolutional layers at the original image resolution are very expensive and also not necessary as objects often span more than one pixel, the image gets downsampled and then upsampled again within the network.

What methods did we learn about for Unpooling in a Fully Convolutional Network?

Unpooling, archieves the opposite as pooling: It increases the spatial resolution os a feature map, roughly reversing a pooling operation (like max pooling). But: unpooling is not a true inverse of pooling — information was lost during pooling.

The methods we learned about:

nearest neighbor
“Bed of Nails”
Max Unpooling

What are the Upsampling methods we learned about?

Unpooling
Strided Transpose Convolution

What is strided transposed convolution?

Strided transposed convolution is a Upsampling method.

While a convolution with stride > 1 decreases spatial size of the feature map (= downsampling), a transposed convolution with stride > 1 upsamples the feature map.

Strided transposed convolution spreads each input value over a larger output grid, summing up overlapping contributions.

Then, one boundary pixel needs to be cropped to produce an output that is double the input size.

After downsampling and upsampling, are all output pixels based on features that were originally computed at that same pixel location in the input image?

No, their values are interpolated or reconstructed from neighboring pixels, since downsampling reduces spatial resolution (individual pixel identities are lost) and upsampling increases resolution but new pixels are created by spreading or interpolating features. This is a problem because semantic segmentation needs precise localization and pixel-accurate boundaries.

What are the main strategies to fix the loss of spatial detail caused by pooling and downsampling?

Resolution Enhancement: uses upsampling methods which brings feature maps back to the original image size (restores number of pixels)
Feature Transfer: feature maps from earlier layers are transferred & combined with upsampled features. This can be done with skip connections (restores precise boundaries)
Reconstruction: compresses and encodes the input & reconstruct it later as accurately as possible (general method)

Why is skipping connections useful in semantic segmentation?

Ususally, in the network the data is downsampled and later upsampled (this saves computation time).

However, this means that the “where” information is lost during training, while during the training the “what is it” information is gained. The idea is to skip connections from before downsampling to after upsampling, since early layers know the “where”. This information and the data that was upsampled can then be combined.

This forms the U-shape, which is why this technique is called the U-Net.

When the upsampled features and the skipped features are combined, there will be a very high channel count & mixed feature types. Therefore, the network applies several convolutional layers to blend the information & reduce the number of channels.

What is a Feature Pyramid Network and why is it useful?

A Feature Pyramid Network is an alternative to U-Net and is used to combine features from different levels and scales.

It is commonly used in object detection.

How is the output of a semantic segmentation task generated after the network downsampled, upsampled & combined this with the skipped connection data?

The network outputs a feature map with as many channels as there are (predicted) categories/classes. Now, the softmax function is applied on each pixel individually to get the pixel-wise class probabilities.

How is the Loss function structured in a task where an object in an image is supposed to be classified and localized at the same time?

While image classification is a classification task, localization of an object in the image is a regression task. Therefore, a multitask loss is introduced: Two losses are computed and summed up in the end.

What happens when the bounding box values in a localization task are not predicted as normalized values, but the “real” values?

The network wouldn’t be able to predict the bounding box independent on image size. Likewise, commonly not the width and height are predicted, but the square root of these, so that the error won’t be much higher for a large bounding box vs. a small bounding box.

What is a common loss function used in Object Localization?

Intersection over Union (IoU), which is exactly what it sounds like.

First, the intersection between the correct (labelled) bounding box and the predicted bounding box is computed = Intersection.

Then, the union between the correct bounding box and the predicted bounding box is computed.

Then, the area of the intersection is divided by the area of the union. The more similar they are, the closer to 1 they get (ideal case). If the bounding boxes don’t overlap at all, the intersection is 0.

What are possible use cases for object detection?

detect objects in aerial or satellite images, e.g. cars, solar parks, wind turbines…
detect healthy blood cells in a blood sample

What happens if for an object detection task, there are several objects in the image, e.g. several ducks?

Then, the image needs a different number of outputs. It has to be prevented that the network predicts several objects at the same location. There are several different approaches to this, for example the Sliding Window approach.

What is the Sliding Window Approach and how does it work?

The Sliding Window Approach is used in Object Detection and Localization tasks where, for example, there are several objects in the image. The approach tries to minimize mistakes like e.g. predicting several obejcts at the same location.

The idea is to move a window of a specific size over the image that is split into a grid and apply a CNN to the different crops of the image. For this, an additional class “background” is needed.

What are the downsides of the Sliding Window Approach?

The Sliding Window approach is computationally expensive since the CNNs are applied to a large number of locations, resulting in many overlapping bounding boxes for the same objects. Also, for varying objects, windows with different sizes are needed, for example if there is an object which is bigger than the window, it’s not possible to predict the bouning box correctly.

To solve the issue with too many bounding boxes, the non-maximum-suppression algorithm can be used.

How does the Non-Maximum Suppression Algorithm work? Why is it used?

The non-maximum suppression algorithm takes a set of bounding boxes with their objectness scores (how confident the network is that it found an object here) as input and tries to minimize the bouding boxes so that there is only one bounding box per object left. This can be used to minimize the number of bounding boxes after applying the Sliding Window approach.

It works like this:

First, in the input set the algorithm finds the bounding box with the highest objectness score and moves it from the input set to the output set. then, it removes all the bounding boxes from the input set that overlap the just found bounding box by a certain degree, for example with an Intersection-over-union-value greater than 60%.

The algorithm repeats this until no bounding boxes are left in the input set.

There is also the option of soft non-maximum-suppression, which doesn’t remove the bounding boxes but instead lowers the objectness scores of them and only removes the bounding boxes that fall below a threshold. This can be useful if objects in the image overlap, for example two horses standing behind each other.

What is a Region Based CNN (R-CNN)?

A Region Based CNN can be used for object detection and localization.

The original from 2014 has a pre-processing step where image regions are to be found that are likely to contain an object (region proposal, e.g. with selective search algorithm). then, the cutout regions are reshaped (warped) since the objects may have different sizes & ratios. This may cause images to get distorted. Then, R-CNN applies the CNN onto those regions.

This saves some computation time in comparison to applying the CNN on every part of the image.

However, every cutout image is processed one at a time, meaning the network is still extremely slow and takes a lot of disk space. Additionally, the CNN is now used for potentially distorted images which can or cannot contain objects (or only partially), which is not ideal.

How did R-CNNs evolve?

The first R-CNN from 2014 was still very slow and computationally expensive, since every cutout image was processed individually and one at a time.

The Fast R-CNN (2015) was faster since the CNN was run once on the full image before the region proposals, which were then mapped onto the feature map.

Faster R-CNN (2015) uses a Region Proposal Network which means that region proposals are learned, no chosen.

How does Fast R-CNN work?

Fast R-CNN consist of 4 (5) main components:

(input image) -> CNN (outputs feature map) -> RoI (Region of Interest) Pooling -> fully connected layer -> softmax classifier/bounding box regression

In constrast to R-CNN, the input image is fed through a CNN first, which produces a spatial feature map for the whole image. Then, region proposal needs to be done, e.g. by the selective search algorithm (not learned, calculated outside the network!).

Since these regions of interest have different sizes, but fully connected layers require a fixed-size input, they need to be resized. This is what happens in the RoI pooling layer, which divides the Region of interest into a 7x7 grid and applies max pooling on each grid cell. (later, fully connected layers are avoided, e.g. in YOLO)

Then, the pooled features are fed through the fully connected layers.

The network is trained using a multi-task loss (sum of softmax loss for classification and smooth L1 loss for the bounding box*weight for balance) end-to-end (meaning that all model components are trained together in a single optimization process using one loss function).

Why is max pooling used in RoI pooling instead of interpolation?

RoI pooling is used on feature maps, not raw images. This means that high values mean “important feature detected here”.

Interpolating values would mean weakening these high activation areas, making object detection less precise. This is why max pooling is used.

What is the difference between Fast R-CNN and Faster R-CNN?

Faster R-CNN doesn’t require the region of interests to be calculated seperately anymore, but instead uses a Region Proposal network to learn the regions of interest. This means Faster R-CNN used 4 losses instead of 2 and is a 2-stage object detector: First, the RPN which predicts the bounding box and if there is an object or not (binary class.), while the rest the predicts the exact bounding box and which object specifically is located there.

How does the Region Proposal network work?

The Region Proposal Network is used to learn region of interests in an image where an object may be located, used in Faster R-CNN.

It slides a small network over the feature map, and predicts the objectness score and the bounding box offsets at every spatial location using anchors.

An anchor is a pre-defined bounding box centered at a feature map location with a specific scale and aspect ratio set by the network designer (e.g., if you have a satellite image and you’re searching for cars, a smaller anchor would be reasonable; or, use different anchor boxes at each point). Basically, every cell is the central point of a bounding box with a set size:

Now, a sliding window slides over the feature map, and the convolutional layer decides for each anchor in this window if this is an object or the background. Additionally, it is predicted whether the anchor is a “good” bounding box for this object.

The Faster R-CNN with RPN is trained with 4 losses:

classify if object/ no object (RPN)
regression for box coordinates (RPN)
final classification score (object classes, RoI pooling)
final box coordinates (apply non-max. suppression to filter out overlapping bounding boxes, RoI pooling)

Therefore, the RPN simultaniously regressses region bounding boxes & objectness scores at each location on a regular grid, following a sliding window approach.

What is the YOLO architecture?

YOLO = You only look once.

Instead of proposing regions and then classifying them, YOLO predicts bounding boxes and classes directly from the image in one forward pass, which is why YOLO is also called a single-stage detector (Faster R-CNN: 2-stage object detector).

YOLO divides the image into a 7x7 grid, where each grid cell predicts bounding boxes & class probabilities. This results in an output of 7x7x(5*B + C) where B = numbers of “base boxes” centered at each cell and C = classes. Modern YOLO versions use pre-defined anchors like Faster R-CNN.

Later YOLO versions like YOLO 9000 (V2) also don’t use fully connected layers anymore and replace them by convolutional layers, so that the input doesn’t need to be fixed-size anymore. Therefore, images don’t need to be resized anymore which could e.g. distort the image. Additionally, convolutional layers preserve spatial information since they “look” at one spatial location over all channels, while fully connected layers connect each location with each channel.

The output is therefore an activation volume.

How is Instance Segmentation different to Semantic Segmentation or Object Detection?

Semantic Segmentation divides the image pixel-wise into different classes. However, if there are several objects of the same class and they overlap each other, there will be no distinction between them. Object Detection then seperates these objects from each other, but draws only a bounding box. Instance Segmentation then does the pixel-wise prediction of every object, regardless if they belong to the same type of object or overlap each other. Therefore, in Semantic Segmentation, there is a set amount of classes (each pixel belongs either to background, cat or dog) while in Instance Segmentation, each pixel belongs to an unknown set of class instances (since there can be several dogs/cats).

What is Instance Segmentation?

Instance Segmentation combines two tasks: assigning a class label to every pixel, and distinguishing individual objects of the same class. It creates pixel-wise masks which are seperated by object instance, not only a class.

Why is grouping required for Instance Segmentation?

Since each pixel belongs to an a priori unknown set of instances, the model must decide which pixels belong together as one object instance, since the model needs to identify object pixels and cluster them into distinct objects. This is not a pure pixel-wise prediction anymore.

However, frameworks like Mast R-CNn avoid this issue by instead of asking which pixels belong together, it asks where each object is and what its mask is.

What is Mask R-CNN and why is it used?

Mask R-CNN is an instance segmentation framework which for each detected object predicts what is is (its class), where it is (its bounding box) and which pixels belong it it (its mask). It works like Faster R-CNN with an additional mask prediction branch.

How is Mask R-CNN structured?

CNN Backbone (for example ResNet), which extracts deep feature maps
Region Proposal Network, which slides over feature maps and predicts object candidates
RolAlign, which is an improvement of RolPool by keeping floating-point values & uses interpolation instead of max pooling
three parallel processes
1. Classification: outputs class probabilities
2. Bounding box regression: refines object location
3. classification for mask: small FCN which outputs a binary mask (yes/no) for every object proposal

Why are 1x1 convolutions useful and where are they used?

1x1 convolutions look at one spatial location in the image since they don’t consider neighboring pixels, but they mix all channels. This is a way to go from a lot of channels to less channels (for example, 100 channels -> 10 channels by applying one 1x1 convolution with 10 filters since per filter, one output channel is produced).

What are the challenges of using Deep Learning on 3D point clouds?

In images, data is on a regular grid and every pixel has fixed neighbors. In 3D point clouds, there is just a set of points in 3D space & therefore no grid (they are unstructured!). Since there is no fixed spatial structure, standard convolutions don’t apply here.

Additionally, a point cloud is a set which is unordered so that the same point set ordered differently represents the same set. but neural networks treat inputs as ordered tensors and will therefore give different outputs if the order changes. Therefore, the network must be permutation invariant (output doesn’t depend on point order).

Also, 3D point clouds have varying number of points and point density.

What is PointNet?

PointNet is a unified framework for various tasks, such as object classification, part segmentation or scene parsing. It is now widely used as a feature extraction module within more complex architectures. It learns directly from raw point sets by treating each point independently and then combining them with a permutation-invariant operation (output is independent from point order).

How is PointNet structured for classification?

First, each point is processed independently with MLP (multi-layered perceptrons). Then, max pooling (or mean pooling) is applied across points for each feature channel so that from all points, the highest values for each feature channel are selected (which gives the highest activation per dimension). This is symmetric since the order of the points does not matter. Then, another MLP is applied to output the class scores.

Why does PointNet use MLP instead of convolutions?

Convolutions need a grif with a fixed notion of “neighbor” since the filters in them always look at left, right, up down etc. Point Clouds don’t have that since they are scattered irregularily and have different point densities, therefore PointNet needs to use something else. It uses point-wise MLPs, so it looks at one point at the time.

What are the downsides of PointNet?

The points don’t interact before max pooling (or mean pooling) is applied, so it doesn’t model local neighborhood relationships, which is not very efficient. It is better to process all points at the same time, which is done by PointNet++.

More precisely:

The main limitation of PointNet is that it processes points independently and aggregates them globally using max pooling, without modeling local neighborhood relationships. As a result, it cannot capture hierarchical or local geometric structures.

What is a Spatial Transformer Network?

A Spatial Transformer Network learns how to translate & rotate the input into a normalized view, which makes it leasier later to predict. Example: rotate a number that lies horizontally into “normal” view.

It can be useful for processing point clouds, since in point clouds, objects that are the same but are rotated differently look very different to neural networks.

They are not used as standalone modules today, but their key idea of learned geometric normalization remains central to many modern architectures.

What happens in (the first) MLP of PointNet?

MLP is used in PointNet on every point of the input layer and processes each point individually.

One neuron computes the plane equation; the output is the distance to the plane. After ReLU, points on one side of the plane activate, and points on the other side are zeroed. Each neuron tests: Is this point on this side of the plane?

Since in MLP, the layers are fully connected, the outputs of all neurons in one layer are inputs for all neurons in the next layer.

Therefore, in deeper layers, the neurons activate when a point satisfied several geometric constraints (combination of several plane equations, basically). This builds edges, corners & convex regions.

That is why MLP represents a learnable 3D composition of space into convex cells, and the points activate depending on how they are located with respect to these cells.

Why doesn’t k nearest neughbor work anymore for geographical 3D point clouds?

Large, flat objects like roofs and roads may look the same locally with just k = 20, so the model won’t be able to distinguish between them. But, if the neighborhood is chosen to be bigger, the result isn’t that good anymore for smaller objects like poles, power lines, or cars.

What is Multi-Scale Feature Extraction and why is it used?

Multi-Scale Feature Extraction is a method that is used for 3D Point Cloud classification. It doesn’t ask what the k nearest neighboring points look like, but instead asks what the point looks like at multiple spatial scales.

This means that for every sampled point, Multi-Scale Feature Extraction collects its neighbors multiple times, but with either different radii (multi-scale grouping) or different resolutions (multi-resolution grouping).

What is the difference between Multi-Scale Grouping MSG and Multi-resolution gouping MRG?

Both are used in Multi-Scale Feature Extraction.

MSG takes the same center point and builds several neighborhoods, each with a different radius. It captures what this point looks like locally vs. globally and extracts features seperately to concatenate them later.

MRG uses multiple layers with different resolutions and samples points at each layer, groups them and extracts features.

Give some information about Multi-Resolution Feature Extraction.

Where are the centers of the regions located from which the features are to be extracted?

How many regions are be to defined?

How large should they be?

How many points are to be included in each region?

In Multi-Resolution Feature Extraction, the centers of the regions are existing point positions. There are as many regions as there are points, but for efficiency reasons, it’s better to use a sampled subset of the points. The regions should be large enough for their totality to cover all points of the point cloud (& ideally have some overlaps too). There need to be as many points in each region as the MLP needs to extract features from.

How does the concept for Point Cloud Classification look like?

First, there are the (point) set abstraction layers (typically 3 -4), where the number of points decreases, but the number of feature channels increases. This is the (feature) encoder. Next is the decoder with the feature propagation layers, where the number of points increases again and the feature channels are distributed to new points. There are also skipped connections between the encoder and the decoder, similar to U-Net.

How do the Set Abstarction layers/Feature Encoder for Point Cloud Classification work?

The set abstraction layers are a combination of sampling & grouping as well as PointNet.

In the first set of the abstraction layer, a (predefined) number of points are sampled for which the features are to be extracted - so, there is a thinning of the spatial resolution with an increase of feature depth.

For every sampled point, a specified number of points within a defined neighborhood region are identified and collected into a group. This is called grouping.

The feature tensors from these groups are collected.

Then, for every group of neighbor points, features are extracted with the help of PointNet.

So: It consist of sampling, neighborhood grouping, and feature extraction.

How does Farthest Point Sampling work?

Farthest Point Sampling is used as a Sampling Method in Point Cloud Classification. It’s used to thin the spatial resolution of the point cloud.

It works like this:

First, insert the first point from the input set into the result set. Then, while the result set does not contain the requested number of points, ientify the point with the largest minimum distance to all points which are already in the result set and insert it into the result set.

This would lead to 4 points which form a circle of sorts, then one in the middle, etc.

This results in a uniform point selection despite an uneven point density.

How does Grouping work for Point Cloud Classification?

For every sampled point, identify a specified number of points within a defined neighborhood region and collect them into a group. This is done using a fixed ball radius query or k nearest neighbors. Each griup is centered at the sampled point, resulting in a local coordinate system with the sampled point at the origin.

If a fixed radius ball query is used, a fioxed neighborhood size is enforced, which may include duplicate points when too few neighbors are found. Since the symmetric function yields the same output despite the presence of duplicate points, this doesn’t matter.

If k nearest neighbors is used, every neighborhood has k unique points, but the neighborhoods may be of different sizes.

In addition to the centered coordinates of the k neighboring points, the corresponding feature channels of these neighbors are also included for the subsequent feature extraction.

So, the centered coordinates are grouped as well as the features of the points. Then, they are concatenated before they are processed with PointNet.

What is the Bottleneck of Point Cloud Classification?

The Bottleneck is the area in the network where the set abstraction layers produces an output: Fewer points with rich features that went through all abstraction layers, and further points with intermediary features that went through a few layers, but were not sampled for the following layers, as well as the remainder of points, for which no feature extraction was performed at all.

The feature propagation layers use this output as an input. They reintroduce points that were discarded by the set abstarction layers and propagate features from the feature-rich points to the re-introduced points, which combines all available feature information per point.

How does Feature Propagation work for Point Cloud Classification?

Feature propagation is performed by interpolating each featrue channel using inverse distance weighting (IDW) from the three nearest neighboring points.

What are the differences in the feature encoder for images vs. 3D point clouds?

For images, the features at a pixel location are extracted using a 2D convolution on k x k neighbor pixels. For §D point clouds, features at a point location are extracted using PointNet operations on k neighbor points.

For images, the number of pixels is reduced by max pooling or using convoluitions with a (higher) stride.

For 3D point clouds, this is done by sampling.

What are the differences in the decoder for images vs. 3D point clouds?

For images, upsamplling is done with learnable transposable 2D convolutions. For 3D point clouds, upsampling is done with inverse distance weighting.

Pixel features in images are combined by 2D convolutions.

Point features in 3D point clouds are combined usping MLP (point wise convolutions).

What is the Exploding Gradient problem?

The exploding gradient problem occurs during the training of deep neural networks, especially recurrent neural networks (RNNs), when the gradients become extremely large as they are propagated backward through the network.

During backpropagation, gradients are computed using the chain rule. If tghe derivatives have values greater than 1, repeated multiplication can cause the gradient to grow exponentially as it moves backward through layers or time steps.

Can cause training to diverge & can lead to numerical instability.

If this is a problem: use gradient clipping (scale gradient if its norm is too big)

What activation functions did we learn about? Are they zero-centered or not?

sigmoid = 1/(1+e^-x), not zero centered
tanh, zero-centered
ReLU (max(0,x), not zero centered
Leaky ReLU (max(0.1x,x), doesn’t cause neurons to “die”
ELU (Exponential Linear Unit), all benefits of ReLU, slower to compute
Maxout, generalizes ReLU & Leaky ReLU

Does it matter whether the activation function outputs zero-centered or not zero-centered values?

Non-zero-centered activations:

Shift the mean activation away from zero
Create correlated gradients across neurons, since gradients tend to have the same sighn
Make the loss surface harder to optimize

Zero-centered outputs reduce this correlation and make optimization smoother.

Which activation function to choose?

Use ReLU by default, but be careful with the learning rates

• Leaky ReLU, ELU, and Maxout are worth trying out

• Do not expect much improvement from tanh

• Do not use sigmoid (unless your outputs need to be between 0 and 1)

• There are many more activation functions to choose from

What kind of regularization methods did we learn about?

add regularization term to loss function, for example L1 or L2 or Elastic Net (L1 + L2)
dropout (In every training step, neurons are ignored during training with a certain probability (hyperparameter dropout rate), prevents co-adaptation - high dependency of neurons, forces active neurons to learn more robust features; usually results in much better models, but slows down convergence)
max-norm regularization (constraints the weights for each neuron)
data augmentation (Increase the size of the training dataset by generating variants of each training instance, e.g. flipping the image, cropping, scaling, randomize contrast & brightness, rotation etc., increases size of training data, but also increases network tolerance to variations in positioin, orientation etc.)

How can you schedule the learning rate?

Piecewise constant scheduling: Use a constant learning rate for a number of epochs, then drop it to a smaller value and use it again for a number of epochs
Performance scheduling: Measure the validation error every n epochs and reduce the learning rate by a given factor 𝜆 whenever the validation error stops going down
Power scheduling: Function of the iteration number 𝑡 that drops at each step, slowing down more and more depending on s (combination of time-based decay and polynomial decay)
Exponential scheduling: Learning rate drops by a factor of 10 every s steps
1cycle scheduling: Increase the initial learning rate 𝜂0 linearly up to 𝜂1 in the first half of training • Decrease the learning rate down to 𝜂0 again in the second half of training • Linearly dropping the learning rate down by several orders of magnitude in the last few epochs -> often able to considerably speed up training and reach better network performance

What is the difference between optimizers and Scheduling?

Adaptive optimizers like Adam, RMSProp or AdaGrad adjust the learning rate per parameter (local!) & based on gradient statistics.

Scheduling changes the learning rate globally over time.

Adaptive optimizers:

✔ Fast initial convergence ✔ Handle poorly scaled gradients ❌ Can settle in sharp minima ❌ May generalize worse if LR stays high

Scheduling:

✔ Helps fine-tune weights later ✔ Encourages convergence to flatter minima ✔ Improves generalization

Modern deep networks:

Rarely get stuck in bad local minima
Much more often slow down at saddle points

Optimizers help by:

Momentum → escape flat regions
Adaptive steps → handle curvature differences

Schedulers help by:

Reducing noise as training progresses
Allowing convergence after exploration

In which cases is a feedforward network not sufficient? What can be used instead?

Feedforward networks assume independence between inputs. However, many problems, such as image -> words or time series involve sequences, where order matters. For these problems, instead, Recurrent Neural Networks can be used which introduce memory so the model can use past information.

What are the 3 typical input-output patterns of Recurrent Neural Networks?

many -> many (e.g. time series prediction) with loss function which includes the partial losses of all steps; the model is trained on all time steps
many -> one (e.g. sequence classification)
one -> many (e.g. sequence generation)

What are some examples of RNNs?

Polygon-RNN: input bounding box, output is outline of polygon which allows user correction
character-based language model: input word, output is predicted next character at each step; uses characters as vocabulary
code generator

How does Backpropagation work in Recurrent Neural Networks (RNN)?

There are 2 types of backpropagation (backpropagation through time):

full: forward pass over entire sequence, backpropagate through all time steps -> vanishing/exploding gradients, very expensive
truncated: backprop only over last k steps, trade-off between efficiency & long-term dependencies

What is the Long Short Term Memory LSTM?

A vanilla RNN forgets important information over long sequences, while an LSTM learns what to forget/store/expose.

A LSTM has two states: cell state (long-term memory) & hidden state (short-term memory -> output)

Each gate (input, forget, etc.) looks at previous memory or input & decides what to do with the information & what to expose as current output. Each gate is basically like a small NN. So the memory isn’t overwritten for every step like in a vanilla RNN, but instead the memory is updated additively.

Sequences are üprocessed step by step, which is slow since there is no parallelization.

On which input structures do these NN work?

MLPs
RNNs
CNNs
GNNs

MLP: fixed-size vectors
RNN: sequences
CNN: grids (images)
GNN: graphs

What is the challenge for working with a graph in Neural Networks?

Graphs are irregular -> no fixed number of neighbors, no natural grid.

How do Graph Neural Networks (GNN) work?

Graph Neural Networks use graphs as input.

A graph has nodes/vertices, where each node has a feature vector. The lines between the nodes are called edges. The connectivity between them is described by an adjacency matrix (tells you which node is connected with which).

At each layer of a GNN, a node updates its feature vector by aggregating the feature vectors (e.g. with mean) of its neighbors. This is called message passing. The updated feature vector is called node embedding.

After multiple (n) layers, nodes contain information from 1-hop & 2-hop & n-hop neighbors.

What are tasks that can be solved with a GNN?

graph classification: one label per entire graph (e.g. this is a radial/rectangular drainage pattern -> geomorphology)
node classification: one label per node (e.g. for simplifying polygons (keep node/remove node))
node regression: predict continuous values per node (e.g. predict temperature at weather stations or predict traffic speed)

What frameworks did we learn about for Graph Neural Networks?

GraphSAGE (Graph Sample and Aggregation)
GAT (Graph Attention Network)
GCN (Graph Convolutional Network)

How does GraphSAGE work?

GraphSAGE is a Graph Neural Network framework.

When some nodes have many neighbors, aggregating all of them is computationally expensive. GraphSAGE solves this by neighbor sampling. For each node, a fixed number of neighbors is randomly sampled, which results in a subgraph which contains the target node & their randomly selected neighbor nodes, which are then aggregated (e.g. ith mean = treats all neighbors the same).

Since the number of to be sampled neighbors is fixed, all sampled subgraphs have the same structure, which means they can be processed in batches in parallel.

Due to the sampling, GraphSAGE is scalable to millions of nodes.

How des GAT (Graph Attention Network) work?

GAT introduces attention; GraphSAGE for example treats all the sampled neighbors the same. But not all neighbors are equally important. This is where GAT computes the importance of neighbors to a specific node.

For every target node in the graph, GAT learns a node embedding (feature vector) by a combination of the immediate neighboring nodes, ususally 1 hop nodes.

For each node pair, an attention score is computed from concatenating the node embeddings & their shared attention matrix.

This score is then normalized using the softmax function to get a probability distribution for how important a neighbor is to a node (how much the neighbor should contribute to the prediction of that node).

Graph Attention Network (GAT) is computationally efficient as computation over 𝑚 attention heads can be done independently and is therefore parallelizable

What Aggregation functions are used for Graph Neural Networks?

Mean: computes the mean feature vector of the neighbor feature vectors
LSTM (Long short term memory): Since LSTM cells process input data sequentially, the neighbor nodes are randomly shuffled, and their feature vectors inputted to the LSTM in this random order
Pooling: The neighbors’ feature vectors are fed to a feedforward neural network (MLP), and a pooling operation (min, max) is applied to the result
alternative: put in the target node in the feature aggregation as well

How is a GNN structured?

input (graph), then a feature aggregation, then some other function like activation function. Repeat this some times, then in the end, class label probabilities could be derived using the softmax function.

How to GCNs (Graph Convolutional Networks) work?

GCNs connect the concept of GNN to convolution. For GNN, a convolution isn’t the weighted sum of neighboring pixels, but the weighted sum of neighboring nodes.

For this, the (normalized) adjacency matrix is used (normalized so that not nodes with many neighbors get larger values than nodes with few nweighbors).

How could you do Polygon Simplification with Graph Neural Networks?

do node classification (which node to keep or remove or move) -> softmax output
do node regression (predict discplacement for “move” nodes; multipl displacement values with node label, where every label that is not “move” is set to 0, so that only “move” nodes actually move)

What does Point-Pair Graph Learning do (in simple terms)?

Point-Pair Graph Learning baically detexts points (nodes) first, then learns which pairs of points should be connected (predicts edges).

What is PGGNET?

PGGNET is a Deep Learning Framework for Point-Pair Graph Learning, which learns the graph structure from images by detecting nodes from an RGB image (junction detection), predicting edges (line segment alignment) & constructing a graph from these.

How does PGGNet work?

PGGNet uses RGB images an an input.

A convolutional layer predicts a junction heatmap (to detect nodes) from a feature map and keeps strongest K junctions, which are the new graph nodes.

For K junctions, a K x K adjacency matrix is constructed, where each pair is a potential edge. The feature map is then reused to sample feature values between two junctions/for every edge.

K is limited here since the complexity increases exponentially (n^2).

For each possible edge, PGGNet classifies whether the junctions are connected or not connected and outputs the learned adjacency matrix.

What is Point2Roof?

Point2Roof is a special case of Point-Pair Graph learning where polygons (roofs) are reconstructed from point clouds.

How does Point2Roof work?

Point2Roof uses point clouds as an input.

First, it for each 3D point, it predicts whether it’s near a vertex (corner).

Then, for each candidate corner, an offset vector is predicted with regression, which pulls the points closer to the true corner (vertex). The final vertex positions are computed from the nearby points. These are the graph nodes.

Next, for each pair of nodes, the model decides whether an edge between them exists. For this, it’s assumed that every vertex is connected by an edge (complete graph).

For each vertex pair, their features are combined (order doesn’t matter = symmetric). The network then learns which feature channels are important for each vertex pair (learns wieghts). Each feature channel of the vertex is then scaled with the weights so that more important ones are “boosted”. Then, max pooling is applied to keep the stronger signal from either vertex for each feature dimension (if one vertex strongly indicated that an edge should exist = signal survives).

This then goes through a classifier which decides whether to keep the edge or not.

What is “Squeeze and Excitation” in the context of Point2Roof?

Squeeze = pairwise feature aggregation

Excitation = channel-wise attention weights

So “Squeeze” combines the information from the detected vertices, while “excitation” learns from the combined information how strongly each channel should be activated.

This is a step done to predict edges in Point2Roof. The next step is to choose which edge is removed and which is kept.

What is a transformer network?

A transformer network is a neural network architecture that processes the all input elements at once and uses self-attention to let each element decide which other elements are most relevant to it.

This is different to a RNN, which processes every element after one another.

In contrast to Point-Pair Graph Learning, which connects the information step by step over increasing regions/neighborhoods, a transformer network connects every point to every other point immediately.

This makes it powerful for data with complex dependencies.

How does a transformer network work?

A transformer network processes the whole input at once and treats it as a permutation-invariant set of tokens, for example points in a point cloud.

Because self-attention itself has no notion of order or spatial structure, positional encoding is added to the token embeddings before they are processed by the transformer. This provides information about order or spatial location, which is especially important for geographical data, allowing the model to learn relative positions and spatial relationships.

Using self-attention, each token (vector) can attend to all other tokens by computing Query, Key, and Value representations, which are used to weight how important other tokens are for updating its own representation. Multi-head attention allows the model to capture different types of relationships in parallel, such as geometric or semantic similarity, and the resulting features are concatenated and linearly combined.

This multi-head attention, together with residual skip connections & normalization & position-wise MLP (feed-forward) & another residual connection & normalization gives the encoder block structure for transformer networks. These blocks can be stacked.

The decoder is only needed if you want to generate a sequence, e.g. translation or text generation. Many transformers use only the encoder or the decoder part.

What properties should positional encodings have?

bounded values for stable training
can handle arbitrary sequence length
relative position inferable (ableitbar) so that model learns “near” vs. far”
optional: train encoding as part of network, use sinusoidal functions

What is sinusoidal position encoding?

Instead of encoding position as absolute integer positions, use sinusoidal functions based on the position in the sequence. We say: “Position = a combination of sine and cosine waves at different frequencies”, which allows the model to infer relative positions and generalize to unseen sequence lengths.

Which transformer networks (encoder-only, decoder-only) are used for which task?

Encoder-only:

Used for tasks such as classification, language understanding and encoding a sentence or a passage into a fixed-length vector representation
vision transformer

Decoder-only:

Used for language generation and decoding a fixed-length vector representation into a sequence of tokens
e.g. ChatGPT

Encoder-Decoder:

for example translation tasks

What are advantages vs. drawbacks of transformer networks?

Advantages:

parallelizable computation
dynamic weighting: model adapts connectivity based on content & not on fixed topology
unified architecture across domains: Transformer learns topology, instead of it being encoded in the input
global receptive field, since every token directyl interacts with all others

Drawbacks:

lots of training data needed
high memory requirements
computationally expensive

What is 3D Mesh Generation?

3D Mesh Generation is a sequence modeling problem where the goal is to create a mesh (consist of vertices and faces) from data, for example with a Transformer Network.

What are the steps taken for 3D Mesh Generation?

Generate Vertices
Generate Faces

Why did Transformers largely replace RNNs for many sequence tasks?

1️⃣ Parallelization RNNs process sequences sequentially, which prevents parallel computation. Transformers process all tokens simultaneously, enabling efficient GPU parallelization.

2️⃣ Long-range dependencies In RNNs, information must propagate step-by-step through hidden states, making long-range dependencies difficult to learn. In Transformers, self-attention directly connects every token to every other token in a single layer.

3️⃣ Gradient stability Transformers avoid repeated temporal multiplication of recurrent matrices, reducing vanishing gradient issues compared to RNNs.

Beitreten

Vorschau

Author

Julia S.

Informationen

Zuletzt geändert
vor 7 Tagen

Kurs melden