undefined

Buffl

by Alexander R.

Define Machine Learning

field of understanding and building methods that leverage data to perform a certain task

Method learns from Experience E with respect to some class of Task T, and performance measure P
Performance at Task T improves (measured by P) with experience E

What are the different learning methods in ML ?

Supervised Learning
- Labelled data
- Can perform classification or regression
- use of continuous or discrete labels
Unsupervised Learning
- Unlabelled data
Reinforcement Learning
- Agents performing actions in environment

Difference between supervised and unsupervised learning?

Supervised
- Labelled data
Unsupervised
- Unlabelled data

What is semi-supervised ?

partly labelled data

What is weakly supervised ?

noisy labels

Examples of supervised machine learning methods

Linear Regression
Bayesian Regression
Support Vector Machines

Examples of unsupervised machine learning methods

PCA
Cluster algorithms (K-Means, Gaussian Mixture Models)
Autoencoder, GAN

Tasks of Machine Learning

- Classification

- Regression

- Text-to-speech

- Synthesis & Sampling

- Probability density estimation

- Anomaly detection

- De-noising

Performance Metrics of Machine Learning

- Distance

- Euclidean distance

- Manhattan distance

- Supervised

- Confusion Matrix

- MSE/RMSE

- Accuracy

- Precision

- Recall

- Unsupervised

- Pairwise correlation

- Mahalanobis distance

- Inter-/Intra-Cluster Distance

What is Precision?

What proportion of positive "predictions" was actually correct?

What is Recall ?

What proportion of actual positives was identified correctly?

Explain the confusion matrix

- True Positives: Positive predicted and actual positive

- False Positives: Positive predicted and actual negative

- True Negatives: Negative Predicted and actual negative

- False Negatives: Negative Predicted and actual positive

What is the standard linear regression equation?

General linear regression model

Assumption: target variable t is the sum of the predicted value of model y(x,w) and some noise epsilon

What is the distribution of the noise that is assumed in Linear Regression ?

epsilon follows a normal distribtuion

with precision β
precision β is the inverse of the variance of the normal distribution.
In mathematical terms, if β=1/σ**2
a higher precision means a lower variance

What is the Maximum Likelihood Estimate (ML) ?

statistical method used for estimating the parameters of a model.
choose the parameter values that make the observed data most probable.
given a set of data and a statistical model, MLE finds the parameter values that maximize the likelihood function,
which measures how well the model explains the observed data.

How to get the MLE in linear regression ?

objective: minimise the Least Squared Error between the target t and the estimate y(x)

find the minimum of E: differentiate with respect to w and set derivative equal to zero

solve the normal equations, we get

solving this for w, we find the least squares estimate

What is the Maximum A Posteriori (MAP) Estimate ?

method used in Bayesian statistics to estimate a model's parameters.
closely related to the Maximum Likelihood Estimate (MLE),
key difference: MAP incorporates prior knowledge about the parameters through a prior distribution.
MAP estimate is found by maximizing the following:

the evidence is the same for all parameter values (it is a constant based on the observed data)
not only about fitting the model to the data (as in MLE) but also about fitting it in a way that is consistent with what was believed about the parameters before the data was seen.
can lead to different estimates from MLE, when the prior is strong or the data is limited.

What is multivariate linear regression ?

technique that models the linear relationship between
- multiple independent variables (also known as predictors or features)
- and a dependent variable (also known as the response or outcome).

What is the design matrix ?

matrix that captures all the data used to make predictions about your dependent variable
the design matrix includes:
- Intercept Term: If your model includes an intercept term (also known as a bias), the first column of the design matrix is typically a column of all ones.
- Independent Variables:
  - Each column in the design matrix represents one independent variable (also known as a feature or predictor) in your dataset.
  - Each row represents an observation.

What are basis functions ?

functions used to represent the data within some space in a way that makes it easier to model.

basis functions transform the input variables into a new space where linear relationships can be more easily detected and modeled.
way to understand basis functions:
1. Simple Linear Basis Functions: the basis functions are just the identity function of the predictors. For example, if you have one predictor x, the basis function is phi(x)=x and the model is just y= c+w*x
2. Polynomial Basis Functions: polynomial basis functions like x,x^2,x^3...x^n; allow to model nonlinear relationships while still using linear regression techniques

Why use basis functions ?

transform the input space,
allowing for more flexibility in modeling relationships between the independent variables (features) and the dependent variable (target).
effects and benefits of using basis functions in regression models:
- Modeling Non-Linear Relationships:
  - can capture non-linear relationships by transforming the input data into a higher-dimensional space
  - where the relationship between the input and the output becomes linear.
  - For instance, a quadratic basis function can allow a linear model to fit a parabolic curve.
- Increased Model Complexity:
  - effectively increase the complexity of the model.
  - advantageous if the true relationship between variables is complex and the basic linear model is not sufficient.
- Improved Fit:
  - The use of appropriate basis functions can lead to a better fit of the model to the data, which can result in more accurate predictions.

What is the recipe for ML linear regression ?

Learn model parameters: Construct target vector t and design matrix ; then solve likelihood function for the weights; then you are ready to make predictions

Apply the model to the data in the test set; Evaluate the RMSE between regressed estimate and measured target variable

Differences between MLE and MAP ?

Prior Information:

MAP incorporates prior knowledge through a prior probability distribution. ML does not

Objective:

ML maximizes the likelihood of observing the data given the parameters.
MAP maximizes the posterior probability of the parameters given the data and prior.

Results:

ML estimates are purely data-driven.
MAP estimates are influenced by both data and prior beliefs.

Convergence with Data:

For large data sets, MAP estimates may converge to ML estimates, assuming the prior is not extremely strong.

Define regularization

used to prevent overfitting
by adding a penalty on the size of the model parameters to the loss function used to train the model.

This encourages the model to be simpler, making it generalize better to new data by keeping the weights small and reducing the model's complexity.
Common types of regularization include L1 (Lasso) and L2 (Ridge) regularization.
Example

Lasso vs. Ridge Regression (and when to use what?)

Use L2 (Ridge) Regression when:

multicollinearity in your features
want to include all features
more features than observations

Use L1 (Lasso) Regression when:

want a sparse model
want to perform feature selection
a limited dataset: When you have a smaller dataset, Lasso can help by selecting only the most important features, which may lead to better performance on unseen data.

Reflect Bayesian regression for linear models

incorporates prior beliefs about parameters
updates these beliefs after observing data.
helps to regularize the solution, preventing overfitting.

1. Define the Prior

Establish the prior distribution for the parameters, typically a Gaussian distribution, `p(w) = N(w|m0, S0)`.

`m0` and `S0` represent the mean and covariance of the prior belief about the parameters.

2. Formulate the Likelihood Function

The likelihood function `p(T|X, w, β)` describes the probability of observing the target values `T` given the input features `X`, parameters `w`, and noise precision `β`.

3. Determine the Posterior Distribution

Using Bayes' theorem to calculate the posterior distribution, `p(w|T, X, β)`,
updates our beliefs about the parameters after observing the data.
the posterior is proportional to the product of the likelihood and the prior.

4. Compute the Posterior Mean and Covariance

Calculate the mean of the posterior distribution using the formula `mN = SN(S0^(-1)m0 + βΦ^T T)`.
Determine the covariance of the posterior distribution with `SN = (S0^(-1) + βΦ^T Φ)^(-1)`.

5. Interpret the Posterior as a Normal Distribution

the posterior is still a Gaussian distribution with updated mean `mN` and covariance `SN`.
the peak of the posterior distribution is the most probable estimate of the parameters (MAP estimate), which is `w_MAP = mN`.

6. Role of the Conjugate Prior

a conjugate prior is chosen so that the posterior distribution remains in the same family as the prior, simplifying calculations.

7. Role of Evidence

evidence, `p(T|X, β)`, is not incorporated when we're only interested in the parameter values that maximize the posterior.

8. Benefits of Bayesian Linear Regression

Bayesian regression accounts for uncertainty in parameter estimates.
it prevents overfitting by incorporating prior knowledge and updating beliefs in a principled manner.

Relate Bayesian sequential learning to regression

sequential learning is the process of updating beliefs about model parameters with each new piece of data.
it treats the posterior from previous data as the new prior.

Update the Prior with New Data

When a new data point arrives,
use the posterior distribution from the previous `N-1` samples as the prior for the new data point.

Calculate the New Posterior

Formulate the posterior probability for the first `N-1` samples
then update this posterior with the likelihood of the new `N-th` data point to get the posterior for `N` samples.

Sequential Update Equation

sequential update for the posterior probability with the new data point, which involves multiplying the previous posterior by the likelihood of the new data point.

Advantages

of allowing for the model to be updated in real-time as new data comes in.
Data arrives sequentially over time, such as online learning or real-time prediction systems.

Apply Full Bayesian approach by computing the predictive distribution by integration over all models

Recall that the posterior distribution `p(w|T, X, α, β)` is a Gaussian with mean `mN` and covariance `SN`.
`mN` and `SN` are derived from the observed data.
The predictive distribution `p(t|x, T, α, β)` allows for making predictions about new, unseen data.

This distribution is calculated by integrating the noise model with the posterior distribution over the parameters.
The predictive distribution is obtained by integrating the product of the likelihood (noise model) `p(t|x, w, β)` and the posterior `p(w|T, X, α, β)` over all possible weights `w`.

The likelihood function for the noise is given by a Gaussian distribution, which represents deviations of the observed target values from the model predictions.

the integral of the product of two Gaussians is itself a Gaussian.

The resulting Gaussian represents the predictive distribution.
The predicted mean `y(x, mN)` is the mean of the predictive distribution, which uses the posterior mean `mN` of the parameters.

Determine the predictive variance `σ^2_N(x)` which quantifies the uncertainty of the prediction.
The predictive variance includes a term from the noise model and the posterior covariance.

the resulting predictive distribution `N(t|y(x, mN), σ^2_N(x))` provides a mean and a variance for the prediction, capturing the uncertainty.

the Full Bayesian approach is focused on predicting distributions, not just point estimates, which is essential for capturing uncertainty in predictions.

Flat prior

No data

Describe Linear Discriminant Functions

used to find a linear combination of features that separates two or more classes
resulting combination is used as linear classifier
The decision rule for LDA is a linear equation:

It creates a decision surface in the feature space where y(x) = 0, dividing the space into regions for different classes.
An input is assigned to a class C_1 if y(x) >= 0 and to class C_2 if y(x) =< 0, assuming the problem is binary classification and the classes are linearly separable

Advantages of LDA

learns the boundaries directly from the data.
no need to estimate the probability density function (pdf) of the data.
weights in the function give insight into the model: the sign indicates positive or negative effect, and the magnitude indicates the importance of a feature.

What approaches can be used for LDA ?

Least squares
Fisher’s Linear Discriminant
Perceptron

How does the LDA method - Least Squares work ?

Given a dataset with features and labels

x_n represents the feature vector
t_n represents the target label
aim to find weights that minimize the sum of squared differences between predicted values and actual target values

X is the design matrix
W is the matrix of the Weights
T is the target matrix
The subscript F denotes the Frobenius norm (measure the size of a matrix)

Pro’s and Con’s for LDA-Method Least Squares

Pros

Closed form solution

Cons

Not robust (sensitive to outliers)
Output are not probabilities (not constraint to (0,1))

Pro’s and Con’s for LDA-Method Perceptron

Pros

Suitable for large datasets as it processes one sample at a time
Guaranteed to converge to a solution if classes are linearly separable

Con

no unique solution; depends on initial weights and order of data points
will not converge if classes are not linearly separable
does not generalize to multi-class problems (only for 2 classes)
Outputs are not probabilities (not constraint to (0,1))

Pro’s and Con’s for LDA-Method Fisher’s Linear Discriminant

Pros:

Dimensionality Reduction
suitable for multi-class problems
closed form solution

Cons: (works best if):

Class means should differ
Gaussian distribution within classes
Similar sample sizes among classes

Mathematical Approach for Fisher’s Linear discriminant

Minimize within-class scatter

Maximize between-class scatter

objective is to maximize the Fisher criterion, which is the ratio of the between-class scatter to the within-class scatter.

To find the best projection we need maximize J(w)

c is a constant
the rest of the equation gives the direction that maximizes the separation between projected class means while also considering the spread of the classes

Step-by-step LDA - Perceptron

Input data D (x is feature vector, t is corresponding target label -1,1)

Minimize the misclassifiaction error with stochastic gradient descent and iteratively

What are Probabilistic Generative Models ?

assume that the data for each class is generated from a Gaussian distribution
for class c_k, the probability distribution function is:

decision rule for classification is to choose the class that maximizes the discriminant function

LDA for Probabilistic Generatve Models

1. Assume Gaussian Distribution for each class

2. Calculate Mean and Covariance for each class

3. Compute Prior Probabilities for each class

4. Decision Function: The decision function for each class is derived from the log of the posterior probability. By applying Bayes' theorem and simplifying, the decision function can be expressed as shown on your slide.

5. Classify New Samples: compute the decision function for each class and assign x to the class with the maximum value of the decision function.

LDA typically makes two additional simplifications to the covariance matrices:

Common Covariance Matrix: assumes that all classes share the same covariance matrix, Σ, which leads to linear decision boundaries.
Diagonal and Equal Variances: In the simplest case, LDA can further assume that the covariance matrix is diagonal with equal variances across all dimensions, which simplifies the computation even further.

Probabilistic Generative Models - Cases of Assumptions

- Different covariance cases

Individual Covariances (no linear decision boundary)
One common covariance
One common diagonal covariance
One common diagonal and equal variance

—> Check that assumptions are made by the data

—> Preprocessing is important!

Define what kernel methods are

are a class of algorithms
map data into a higher-dimensional space using a kernel function
making it easier to perform linear separations between classes.
Commonly used in support vector machines (SVMs), for classification, regression, and other tasks.
measures similarity between pairs of data points in the original space without explicitly performing the transformation.
Popular kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid.
They enable complex decision boundaries in the original feature space, improving the flexibility and accuracy of machine learning models.

How would you apply kernel methods to new problems ?

Understand the problem domain
Choose an appropriate kernel
Preprocess the data: Prepare your data by cleaning (removing noise and outliers) and normalizing it if necessary. This step is crucial for the effective application of kernel methods.
Split the data
Train the model: Use a kernel-based algorithm like SVM to train your model on the training set. During training, the algorithm uses the kernel function to transform the data into a higher-dimensional space, where it finds the optimal boundary between classes or regression fit.
Tune hyperparameters
Evaluate the model
Apply to new data: Once satisfied with the model's performance, you can apply it to new problem instances.

Define what a Gaussian Process is

a probability distribution over possible functions that fit a set of points. For any collection of points, the joint distribution of the GP's outputs is multivariate Gaussian.
In contrast to Bayesian linear regression, which models target variables with a linear function and Gaussian noise, GPs model the target directly, without specifying intermediate weights or assuming a particular functional form.
The GP is fully characterized by a mean function (often assumed to be zero for simplicity) and a covariance function or kernel, which defines the relationship between different points in the input space.
The covariance matrix derived from the kernel function captures the essence of the GP. The kernel defines the smoothness and general behavior of the functions drawn from the process.
For new samples, GPs provide a predictive distribution which is also Gaussian, giving not just an estimate for the target but also the uncertainty associated with that estimate.
used in regression tasks to make predictions about new data points. The GP's ability to provide a measure of uncertainty with its predictions is particularly useful in many applications, like optimization and active learning.

How does the Kernel trick work ?

enables them to operate in a high-dimensional space without explicitly computing the coordinates of the data in that space.

Linear Regression with Kernels

by transforming input data into a high-dimensional space, allowing linear models to capture non-linear relationships.

Least Squares Error

cost function for regression minimizes the sum of squared errors between predictions and actual target values, with a regularization term to prevent overfitting.

Dual Representation

weights are expressed in terms of a new parameter set, which is a linear combination of the input data,
utilizing a kernel matrix that encapsulates the relationships between data points in the transformed space.

Rewriting the Cost Function

The cost function is reformulated in terms of the dual parameters and the kernel matrix, which is used to minimize the cost and train the model.

Advantages of the Dual Form

Only the kernel matrix is utilized, bypassing the need to compute high-dimensional feature vectors.
improves computational efficiency, especially when the transformed feature space is much larger than the number of data points.

Why Use the Dual Form?

Inverting a matrix related to the number of features (in the transformed space) is computationally costly.
The dual form involves inverting a matrix related to the number of data points, which is typically smaller and computationally less expensive.

What is a valid kernel K with elements k(x_n,x_m) ?

also known as Gram matrix K

symmetric

positive semidefinite (any vector transposed and multiplied by $K$ should result in a non-negative scalar)
Low for small distance = high similarity (higher kernel values)
High for High distance = low similarity

Define what SVMs are

supervised machine learning algorithm
SVMs aim to find the best separating hyperplane that divides the data into classes.
"best" hyperplane is the one that maximizes the margin between different classes, where the margin is defined as the distance between the hyperplane and the nearest data points from each class, known as support vectors.
Support vectors are the data points nearest to the hyperplane; the position of the hyperplane is entirely dependent on these points.
In classification tasks, SVMs can separate data points into two or more classes with a hyperplane in the feature space. For two classes, the goal is to find the optimal dividing line (in 2D), plane (in 3D), or hyperplane (in higher dimensions).
SVMs can employ the kernel trick to handle non-linearly separable data. Kernel functions implicitly map input features into high-dimensional feature spaces where a linear separation is possible

Advantages:

effective in high-dimensional spaces, even when the number of dimensions exceeds the number of samples.
memory efficient since they use a subset of training points in the decision function (support vectors).
They can provide different kernel functions for the decision function and can specify custom kernels.

Limitations:

do not directly provide probability estimates, which are calculated using an expensive five-fold cross-validation.
can be less effective on very large datasets or datasets with a lot of noise.

What is the Max-Margin Problem

Max-Margin problem is a formulation used to find the hyperplane that maximizes the margin between two classes in the feature space.

Objective Function

The goal is to find a hyperplane with the smallest possible weight vector w that still correctly classifies the data points.
Mathematically, it's about minimizing the term below

The minimization is subject to constraints ensuring that all data points $ x_n $ are classified correctly, formalized in:

Lagrange Multipliers

To solve this constrained optimization problem, Lagrange multipliers a_n are introduced for each constraint.
The Lagrangian L(w, w_0, a) is constructed by combining the objective function with the constraints, weighted by the Lagrange multipliers.
The problem transforms into minimizing the Lagrangian with respect to w and w_0, and maximizing it with respect to a_n.

Gradient Equations and Dual Representation

setting the gradients of the Lagrangian with respect to w and w_0 equal to zero, conditions are derived that allow w to be expressed as a sum of the data points x_n, scaled by the product of their corresponding Lagrange multiplier a_n and label t_n
A condition that the sum of the products of the Lagrange multipliers and labels must equal zero is also derived.

This leads to the dual representation which only depends on the Lagrange multipliers a_n, and the problem can be reformulated into maximizing the dual representation L(a)

4. **Kernel Function**:

- The kernel function $ K(x_n, x_m) $ represents the inner product of $ \phi(x_n) $ and $ \phi(x_m) $ in the feature space, allowing the SVM to work in a higher-dimensional space without explicitly computing the coordinates.

5. **Maximizing the Dual Representation**:

- The dual problem involves maximizing $ L(a) $ under the constraints that all $ a_n $ are non-negative and the sum of the products of $ a_n $ and $ t_n $ equals zero.

- The dual representation simplifies the problem as it removes the need to work directly with the weight vector $ w $ and allows the use of the kernel trick.

Through the Max-Margin problem, SVMs find a hyperplane that not only separates the data into classes but also stays as far away as possible from the closest data points of any class, aiming for a better generalization to new data points.

How can you train a SVM ?

Step 1: Estimate the Bias w_0

Assuing the support vectors are known, estimate the bias w_0 using the formula provided.
This involves the labels of the support vectors, the Lagrange multipliers, and the kernel evaluations.

Step 2: How to get the support vectors defined by a_n

no direct or feasible way to find support vectors just from multipliers a_n
Need to estimate the Lagrange Multipliers

Sequential Minimal Optimization (SMO)

SMO is a common method for solving the optimization problem of SVM efficiently.

SVM - Pros and Cons

Define Correlation

Measure of how much two variables are "linearly" related
If one variable tends to go up when the other does, they have a positive correlation.
If one goes up while the other goes down, they have a negative correlation.
If changes in one variable don't consistently relate to changes in the other, they might have no or very little correlation.
Correlation is neither good or bad, it depends on application
Pearson correlation coefficient:

Define Principal Component Analysis

Goal of PCA is to project data onto fewer dimensions while keeping as much variation in the data as possible
Find projection that
- maximizes variance
- minimizes reprojection error
Find a low-dimensional space such that
- when x_n is projected there, "information loss" is minimized
new directions must be uncorrelated which means that the covariance matrix is diagonal

Compute Principal Component

Define Eigenvector

a vector that changes at most by a scalar factor when a linear transformation is applied
a vector whose direction remains unchanged when a linear transformation is applied to it

PCA - How to compute Principal Components

Compute Mean (to center data)
Compute Covariance Matrix (M is a matrix where each row is a mena vector)
Get eigenvectors (A - \lamdba * I = 0)
Choose K=1

PCA for high dimensional data

Properties of PCA

PCA is an unsupervised method
It is deterministic, providing the same output for a given input data.
PCA has an analytical solution, which means the principal components are computed through direct computation rather than iterative methods.
involves creating a linear combination of samples to form new principal components.
Critical considerations for PCA include:
- Preprocessing of data is crucial for effective PCA.
- The number of principal components to retain must be determined.
- PCA is not as suitable for very large datasets.

Independent Component Analysis (ICA)

ICA aims to find a set of independent components from the observed data.
Unlike PCA which finds uncorrelated components, ICA finds components that are statistically independent.
used to solve problems like the cocktail party problem, where the goal is to separate mixed signals into their original sources.

PCA vs. ICA

Applications of PCA

PCA is used for data compression and dimensionality reduction.
It serves as an important step in data preprocessing.
PCA aids in creating models for:
- Approximating original datasets.
- Interpolating data for smoother transitions.
- Generating new data samples based on the principal components.
A caution is given that PCA does not inherently understand the semantic meaning of the directions in reduced space.

What are the Principal Components and what property do they have ?

vectors that capture the underlying structure of the data in a dataset after PCA was applied
Each PC represents the direction in the dataset along which the variance is maximized. The first principal component captures the most variance, the second principal component captures the second most…
PCs are orthogonal to each other in the feature space.
PC are constructed as linear combinations of the original features
Each PC is associated with an eigenvalue from the covariance (or correlation) matrix of the data. The eigenvalue measures the amount of variance captured by its corresponding principal component.

Cost Function & Responsibilities of K-Means

Minimizing K-Means Cost Function / Expecation Maximization Algorithm

E-step: Each data point gets assigned to the nearest cluster center, effectively minimizing cost function J with respect to the responsibility r_nk
M-step: Once the points are assigned, the cluster centers μ_k is recalculated by averaging all the points in each cluster. This minimizes J with respect to μ_k
Local convergence is guaranteed
Global minimum is not guaranteed

Limits of K-means / Alternatives

Hard assignments of data points to clusters —> small shift of data point can flip it to a different cluster
Not clear how to choose value of K
Alternative: replace 'hard' clustering of K-means with "soft" probabilistic assignments

Define Gaussian Mixtures

A function that is made up of several Gaussian (normal) distribution
Each of these distributions represents a cluster within the data,
They are combined (or 'mixed') to model the overall distribution of the data.

Sampling from a Gaussian Mixture

In simulation, the mixture parameters are known
To generate a data point:
- Draw one of the components k with probabilities p(k)
- Draw a sample x for each new data point
Repeat for each new data point

How to fit Gaussian Mixture Models

Expectation-Step:
- Calculate responsibilities
Maximization-Step:
- Update the means
- Update covariances
- Update mixing coefficients

Differences between hard clustering and fuzzy clustering

Hard Clustering

Each data point belongs to exactly one cluster.
Clear boundaries between clusters.
Easy interpretation and implementation.
Use when distinct classifications are needed.

Fuzzy Clustering

Data points can belong to multiple clusters and belong to each cluster with a certain degree of membership
Soft boundaries; data points can be shared among clusters.
Use when data exhibits overlap between classes or when the boundaries are not clear.

Different Covariance Shapes for GMMs

Spherical and Diagonal assumptions are computationally less demanding and may prevent overfitting on small datasets but at the cost of model flexibility.
Tied and Full covariance matrices provide more flexibility to capture complex data structures but increase the risk of overfitting and require more computational resources.

Reflect the structure of MLP networks

Contains an input layer, (min.) one hidden layer, and an output layer
A hidden layer processes the inputs x through weights and a bias and applies an non-linear activation function
Output layer computes the final output y using the hidden layers outputs, own weights and biases

How to Regression with NN ?

How to Binary clasification with NN

How to Multiclass classification with NN ?

Explain the role of MLPs in deep learning

DL utilizes MLPs with multiple hidden layers

MLPs learn a hierarchy of features, with each layer capturing increasingly abstract representations of the data
They can approximate any continuous function, making them versatile for various tasks
MLPs use backpropagation for efficient training, allowing them to learn from data by adjusting weights to minimize error
serve as the basis for more specialized deep learning architectures like CNNs and RNNs
The output y of the network is a composition of functions corresponding to each layer's transformation, including the activation functions h and the layer weights and biases W,b

Explain backpropagation

What are common problems when training deep neural networks ?

Vanishing Gradient
Exploding Gradient
Overfitting/Underfitting
Hyperparameters

Parts of a neuron

What activation functions are there ?

Explain Vanishing Gradient

In deep networks, gradients can become very small
exponentially decreasing as they propagate back through the layers.
makes it hard to update the weights in the earlier layers
Activation functions like ReLU can mitigate this issue, because its derivative does not saturate

Explain exploding gradients

If the gradients are large, their effects can get multiplied through the layers, leading to even larger gradients.

leading to large changes in weights and unstable training
Gradient clipping is a common solution.
Reasons for exploding gradients:
- In very deep networks, gradients can accumulate through layers. If the gradients are large, their effects can get multiplied through the layers, leading to even larger gradients.
- If the network's weights are initialized too high or if the weights grow too large during training, the gradients can
- If the network's weights are initialized too high or if the weights grow too large during training, the gradients can
- non-saturating functions like ReLU can contribute to the exploding gradients problem because their derivatives can be large. For example, the derivative of the ReLU function is either 0 or 1, and during backpropagation, this can lead to large gradients if many ReLU units are active at once.

Batches in Gradient Descent

Training algorithms for Neural Networks

Explain overfitting

Regularization in Neural Networks

Explain a CNN

Structured for 2D Data: CNNs are specialized neural networks for processing data with grid-like topology (e.g., images).
Layered Architecture: Typically consists of convolutional layers, pooling layers, and fully connected layers.
Convolutional Layers
- Utilize filters/kernels to perform convolution operations.
- Capture spatial features like edges, patterns, and textures.
- Each filter detects different features by sliding over the input image.
Activation Functions
- Applied after convolution to introduce non-linearity.
- ReLU (Rectified Linear Unit) is commonly used.
Pooling Layers
- Reduce spatial dimensions (downsampling).
- Make the detection of features invariant to scale and orientation.
- Common methods include max pooling
Fully Connected Layers:
- Neurons have full connections to all activations in the previous layer.
- Integrate learned features from convolutional and pooling layers for classification.
Output Layer
- Gives the final prediction, often using a softmax function for classification tasks.
Learnable Parameters
- Weights in filters and fully connected layers are learned during training.
Spatial Hierarchy of Features:
- Early layers capture low-level features; deeper layers build up to high-level features.
Efficiency
- Share weights and use fewer parameters compared to fully connected networks, making them computationally efficient.

Explain Padding & Striding

Explain Batch Normalisation (CNN)

Explain Pooling (CNN)

Observe, Explain, Optimize

How to monitor training of NNs ?

- Tracking loss & other metrics

- Inspecting weights, biases and other tensors

- Inspecting representations to some degree (i.e. embeddings)

- Displaying training data

Interpret Training Characteristics

- Reminder: tracking loss and averaged metrics

- Convergence time

- Absolute best loss/metric values

- Relative training behaviour: stability, robustness,...

- Inspecting weights, biases, activations, gradients and other tensors

- Converging/Diverging ?

- Not changing over training time ?

- Sparse ?

- Within a good norm ?

- Metrics are task-specific and not always meaningful

Examples of Hyperparameters

hidden units
number of layers,
activation function,
convolution (stride/filters),
epochs,
learning rate,
batch size,
optimizer,
regularization,
momentum,...
We optimize these by monitoring the validation loss

How to find good meta-parameters?

- Systematic procedure:

- Intuition + Grid Search (most common)

- Random Search (if we have no idea)

- Bayesian Optimization (best in theory)

- Evolutionary Algorithms, Gradient-based

- In practice:

- Intuition first

- then informed Random Search or Bayesian Opt.

- Always iteratively:

Start coarse search to observe behaviour, then increase granularity

Grid vs. Random Search

Grid Search

Tests all possible combinations of the parameters.

Finds the best parameters if they are within the grid.
Time-consuming: Can be very slow, especially with a large number of parameters or if each model takes a long time to train.
Easy to Implement: set up with clear and simple logic.
More suitable when the parameter space is small

Random Search

Samples parameter settings at random for a fixed number of iterations.
Efficiency: Can find a good set of parameters faster than grid search when the parameter space is large.
Less Precision: May miss the exact best parameters but often finds a close approximation with significantly less computation.
More scalable to high-dimensional spaces.
Better suited for when the dataset or parameter space is large.

Both:

Easy to implement and parallelize
Asynchronous and stoppable/plausible at any given time
But non-adaptive: Still beaten by more complex search algorithms
Bayesian approaches are more intelligent (but hard to parallelize & have own hyperparameters)

—> Random search is better than naive grid search (not all hyperparameters are significant)

Bayesian Optimization

Bayesian optimization

Assume that model performance is a smooth function in the space of hyper-parameters:
- Impose a Gaussian Process prior
Search for hyper-parameters that has the largest chance of improving given the current results
BoTorch, Keras Tuner

Challenges in Neural Network Optimisation - SGD

- Additional problems by SGD:

- Ill-conditioned gradients

- Plateaus, Saddle Points

- Inexact Gradients

- Additional problems by deep architectures:

- Cliffs and Exploding Gradients

- Long-Term Dependencies

- Convergence is never guaranteed!

- Use rule of thumb and related experience for training and representation methods

- Making good use of regularisation

Interpretability vs. Explainability

Interpretability

Being able to determine cause and effect from a ML model

Explainability

Knowing what a node represents and its importance to the model's performance

Explaining network parameters and activation - Weights

Inspecting Weight Matrices
Q: What is the relation between (hidden layer) connections?
Approach: Visualise connection strength directly
Difficulties:
- Lack of Contextualization
- Indirect Interaction
- Dimensionality and Scale

Explaining network parameters and activation - Visualize features

Visualizing features by optimization

Start from random noise image
Optimize image to activate particular neuron:
- Calculate gradient for increasing neuron responses
- Adjust image based on gradient
Objectives
- Applicable to unit or layer of interest

Deconvolution

Forward Pass:
- input image is passed through the network up to a certain layer.
- During this forward pass, all the activations are stored.
- These activations represent what filters have responded to in the image.
Select Activation:
- To visualize the features that a particular filter has learned to recognize
- select a specific activation from a specific filter within the layer of interest
- This activation map shows where the filter responded strongly
Reverse Mapping (Deconvolution):
- Starting from the selected activation, you work backwards through the network
- (this is where the term "deconvolution" is often used, although it's not technically deconvolution in the strict mathematical sense).
- map the activations back to the pixel space of the input image to see what part of the image caused the activation. This involves:
  - Unpooling: Reversing the max-pooling operation by placing the activations back into the location of the maximum values that were recorded during the pooling in the forward pass.
  - Transposed Convolution: Applying transposed convolution operations using the stored weights from the forward pass. This step aims to reconstruct the image area that would induce the activations in the forward pass.
Iterate Back to Input:
- You iterate this process back through the layers of the network until you reach the input layer.
- At each layer, you're essentially asking, "What input would have caused this filter to activate in this way?"
Visualization:
- The resulting mapped image often highlights the patterns or parts of the original input that the filter is responsive to.
- For example, if the filter has learned to recognize edges at a certain orientation, the visualization might show those edges from the input image

Visualising Features via Gradient based Localisation

Gradient-weighted class activation mapping
Attribution of local input importance for class

Attribution / Saliency Map

similar to Grad-CAM

Representation Reduction

How can you identify under- / overfitting?

Underfitting

High Training Error: The model does not perform well even on the training data.
Simplistic Model: The model is too simple to capture the underlying structure of the data (high bias)
Close Training and Validation Error: Both errors are high, but they are relatively close to each other.

Overfitting

Low Training Error: The model performs exceptionally well on the training data.
High Validation/Test Error: There is a significant drop in performance on the validation or test dataset compared to the training dataset.
Large Gap Between Errors: There is a substantial gap between training error and validation error, with training error being much lower.

Types of Sequence Learning + Examples

one to many
many to one
many to many

Types of RNNs (superficial)

Simple RNN
- Previous activation adds context to the current activation
- Examples: Elman network, Jordan Network

Fully Connected Neural Network
- Often called auto-associator
- Examples: Hopfield Network (binary), Boltzmann machine (stochastic)

One to Many - Vector to sequence (RNN)

Many to one - Sequence to Vector

Many to Many - Sequence to sequence

Representations

One-hot encoding

Simplest way to represent things in neural networks
one neuron to each concept/feature (Localist Representation)
Easy to understand
Easy to code by hand
used to represent inputs to a net
Easy to learn
Easy to associate with other representations or responses
One-hot encoding in machine learning and natural language processing contexts
localist models are inefficient whenever data has componential structure --> not enough neurons to code all possibilities

Softmax

desirable in classification: output vector models the joint probability distribution
For classification: may have some generated output values using cross-entropy loss
This can be normalized with softmax so that the values are in [0,1] and add up to 1
Drawback: Computationally expensive for very large vectors (exp)

Representation Distributed

Using simultaneity to bind things together
- Round, yellow fruit: one neuron ?
"Distributed representation" means a many-to-many relationship between two types of representation (such as concepts and neurons)
- Each concept is represented by many neurons
- Each neuron participates in the representation of many concepts
Example: How to distinguish from representing yellow circle and blue triangle

Word Embeddings

Millions of words: Need distributed representations!
Approach: Learning word embeddings:
- Map words to continuous, lower dimensional vectors
- Captures word meaning in the semantic space
Resulting word vector should contain linguistic context information, relating it to other words

Preprocessing Sequences in a Nutshell (Text, Speech/Sound)

Text
- Consider character vs. word level
- Cleaning: special characters, capitals
- Stemming i.e. PorterStemmer
- Tokenization:
  - Character/word into atomic units
  - Build vocabulary over all units
Speech
- Basic format: RAW, WAV, PCM signal
  - sampling frequency
  - bit depth
- Conversions
  - e.g. STFT

Learning with backpropagation through time

Unfolding the network over time provides deep feedforward network (in example: 3 steps)
Then trained like usual BP

Vanishing/Exploding Gradient!

- How To Tackle Vanishing Gradient Problem

First RNN Constraint: Avoid Error Multiplication - LSTM

(Gating)

Avoid Error Multiplication

Activation of LSTM

Gated Recurrent Unit (GRU)

merges the cell state and hidden state
resulting in a more efficient model with fewer parameters than LSTMs

Second RNN Constraint - Multiplicity of Time

Third RNN constraint: No training in hidden layers

Define Markov Chains

mathematical system
hop from one "state" (a situation or set of values) to another.
Different states in a state space
probability of hopping from one state to any other state
Markov Chain gives:

Examples of states in Markov Chains

Hidden Markov Models

3 Main Problems of HMM

Evaluate with Forward Algorithm

Calculate how likely a sequence of observations is given a specific HMM.

Decode with Viterbi Algorithm

Determine the most likely sequence of hidden states that produced the observed sequence.

Learn with Baum-Welch Algorithm

Find the HMM parameters that maximize the probability of the observed sequence.

What to do with Continuous Latent Variables ?

Discretise continuos data
- vector quantization
- Speech: phonemes
- Visual phonemes: visemes

Define Embeddings

transform categorical, discrete, or high-dimensional data into continuous vectors of much lower dimensionality
used then in ML model

Manifolds

Define Word-Embeddings

Latent Semantic Analysis - How to derive word embeddings?

Word2Vec

GloVe

Data Augmentation

General Idea: replace empirical distribution with smoothed distribution
Approach: build automated augmentation in data loader

Transfer Learning

machine learning technique where a model developed for one task is reused as the starting point for a model on a second task

Transfer Learning vs. Fine-tuning - Different approaches

Fine - Tuning

Self-Supervised Pre-Training

Contrastive Learning: CLIP

What does it mean to freeze weights and layers ?

prevent the updating of the parameters (weights and biases) of those layers during training.
done when applying transfer learning, where a pre-trained model is used as a starting point, and only some layers are fine-tuned for a new task.
ensures that the learned features from the pre-trained model are preserved while only the unfrozen layers are allowed to adjust and adapt to the new data.

Sequence Learning with 1D CNNs

sequences are fed into the network as a sliding window with a fixed width.
the network looks at a fixed number of elements at a time
each word is represented by a 6D - vector
Convolution applies a filter or kernel to extract features from the sequence,
Pooling reduces the dimensionality and to capture the most salient features.
The output is a sentiment polarity, which is categorized into two classes after being processed through a fully-connected layer

What is the alignment problem in seq-to-seq learning (Example: Problem in Machine Translation)

the challenge of determining which words in the source language correspond to which words in the target language
involves aligning elements of two languages that have different structures and word order

Define the attention mechanism for seq 2 seq taks

allows the model to focus on different parts of the input sequence
when generating each part of the output sequence,
thereby improving the alignment between input and output elements in tasks like translation.

What are the attention mechanism basics ?

Keys: elements from the input sequence
Query: current element being processed in the output sequence
Values: representations from the input sequence that are used to construct the output

mathematically as a weighted sum:
- For each query, the attention function computes a set of attention weights (α_i),
- which are then used to create a weighted sum of the values (v_i).

The attention weights (α_i) are computed using a score function that measures how well each key corresponds to the current query.
done by using a softmax function to ensure the weights sum up to 1, giving a probability-like distribution over the keys.
The score function can be a separate feedforward neural network that is trained jointly with the rest of the model.

What is self-attention ?

to weigh the importance of different parts of the input when processing each word (or token) within the same sequence.
The goal is to learn which self-activation yields the highest correlation between the current words and their context within the sequence.

Soft Attention vs Hard Attention

Hard Attention:

Selects specific parts of the input data to focus on and ignores the rest completely
non-differentiable, meaning it doesn't allow the use of standard backpropagation methods for training
better performance because it's more focused
more challenging to train due to the need for alternative methods like reinforcement learning or Monte Carlo methods.

Soft Attention:

Weights all parts of the input data to varying degrees without completely ignoring any part.
differentiable, which allows the model to be trained using gradient descent.

Global Attention vs. Local Attention

Global attention

Considers all inputs of window-width for alignment score
Expensive

Local attention

Practical tradeoff between soft and hard attention
Alignment can be monotonic or predictive

Contextual Word Embeddings

Traditional Word Embeddings

context-free, meaning each word is given the same representation regardless of its meaning in context.
For example, "bank" would have the same vector representation in both "bank account" and "bank of a river."

Embeddings from Language Model (ELMo):

provides deep contextualized word representations.
considers the entire sentence to determine each word's embedding
uses an attention RNN (Recurrent Neural Network), which likely refers to a bidirectional LSTM (Long Short-Term Memory) model
- processes text both from left to right and right to left, capturing information from the entire sentence.
context in which a word is used (its syntactic and semantic characteristics) influences the word's embedding.
allows to model complex characteristics of word use and how these uses vary across different linguistic contexts

Architectures for Sequence Processing

Define Transformers

neural network architecture that rely on self-attention mechanisms
to process sequential data in parallel and capture dependencies,
without relying on recurrent layers like in RNNs.
Architecure:
- stacked encoders and decoders
- encoders process the input sequence in parallel to produce a representation
- decoders generate the output sequence from this representation
- also using self-attention and attending to the encoder's output.

Define Positional Encoding

a vector that represents the position of words in a sequence to provide the model with information about the order of words.
needed to maintain the sequence information (word-order) which is vital for understanding language structure and meaning

Define Multi-Head Attention

core feature of the Transformer model.
runs several attention processes in parallel (the 'heads'),
allowing the model to focus on different parts of the input sequence and capture various aspects of the information.
Scaled Dot-Product Attention:
- Each attention head performs a scaled dot-product attention
- involves calculating the dot product of the query with all keys
- scaling the result by the square root of the dimension of the keys
- applying a softmax to obtain weights on the values, and then producing an output.

Why is Multi-head Attention effective ?

- No recurrence needed

- Self-attention:

- Connects embedding & positional information

- Multiple heads:

- Learn different type of relations: structure-semantic

- Example: next-word, verb, subject

How to train/optimise Transformers ?

Unsupervised pre-training
- Related to word2vec: Learn embedding into (lower) transformer blocks
- Typical tasks: language modelling or sentence prediction for unsupervised corpus U, maximise likelihood
Supervised fine-tuning
- Continue learning on downstream task (possibly fix k lower blocks)
- Crucial modification: adapt token representation
- Often the only feasible step for normal labs (with no massive TPU cluster)

How to use Transformers for Computer Vision

Image chopped into 16x16 patches, instead of filtering whole image (CNN)
Position embedding and different levels of relationships learned
Particular strength: emerging attention maps on different levels

Define BERT

Bidirectional Encoder Representations from Transformers
Transformer-based architecture, focus on encoder blocks
pre-trains on a large corpus to learn bidirectional representations of text
Contextual model
can then be fine-tuned for a variety of language tasks.

Fine tuning BERT on Different Tasks

Define GPT

Generative Pre-Training
uses transformer architecture, trained on a large corpus of text in an unsupervised manner
to generate human-like text by predicting the next word in a sequence given the words that come before it
Focus on transformer-decoder blocks
autoregressive, meaning they predict the next word based on the sequence of all previous words
Model Architecture:
- consist of multiple transformer-decoder layers (12x to 48x indicates the number of layers, with GPT-3 having up to 175 layers).
- They use masked self-attention, where each position can attend to all positions up to and including itself during training.

Describe a Text-to-Text Transformer

Focus on training data selection
- massive SuperGLUE benchmarks (supervised)
- randomly corrupted tokens (unsupervised)
Contextual model (Transformer based)
treats all tasks as text-to-text problems, where input text is transformed into output text using a seq2seq Transformer model

Describe ChatGPT training process

Strength and Limitations of Transformers

Transformers are sophisticated pattern matching machines

Best performing embedding for many downstream tasks
Continuous processing, Parallelisation, long memory
Currently best performances on many NLP and CV problems
Successfully deployed in Google Search,...

Criticism is vast:

(Research) competition only possible for big-tech companies
- Computationally expensive training
- Need for vast training data
Does only work with such vast data
Does not "understand" natural language
Training data is full of bias
GPT-3 is a better bullshit artist than its predecessor, but it's still a bullshit artist.” – Gary Marcus
“Focusing on raw computing power misses the point entirely […] We don't know how to make a machine really intelligent - even if it were the size of the universe.“- Stuart Russell

What is the key component of a transformer ?

self-attention mechanism
allows the model to weigh the significance of different parts of the input data differently and is crucial for capturing the context within sequences.

Define Inference

General

“inference is a conclusion that you draw about something by using information that you already have about it.“*
compute the probability distribution over one set of variables given another

ML Context

inference often refers to the process of estimating or concluding about the posterior distribution of a latent variable Z, given observed data X.

Motivation behind approximation

What are variational methods ?

The objective is to identify a function that achieves a specific goal, such as minimizing a cost function or, as in the slide's example, maximizing entropy.

What is Kullback-Leibler Divergence ?

measure of how one probability distribution diverges from a second, expected probability distribution
used for variational inference
Goal: to approximate the true posterior distribution `p(Z|X)` with a simpler distribution `q(Z)`
Given two Gaussian Probability Density Functions (PDFs): p(x) and q(x).
These curves represent two different distributions.
In variational inference, p(x) could represent the true distribution of data, while q(x) represents the approximating variational distribution.

The shaded area represents the integral of the KL Divergence across the range of values.
It quantifies the difference between the two distributions.
The divergence is calculated using the formula:

Goal of Kullback Leibler Divergence

Asymmetry of KL

min KL(p||q)

represents minimizing the KL Divergence where the true posterior (p) is the first argument.
known as the "reverse" or "backward" KL Divergence.
tends to produce an estimate (q) that covers all the modes of the true posterior but may not capture them accurately;
it aims to ensure that the estimated distribution does not assign probability to areas where the true distribution has none

min KL(q||p)

represents minimizing the KL Divergence where the estimate (q) is the first argument.
known as the "forward" or "inclusive" KL Divergence.
tends to produce an estimate that captures the mode of the true distribution very accurately but may ignore other modes;
ensures that all the probability mass of the estimated distribution is placed where the true distribution has its probability mass.

General

choice between minimizing KL(p||q) versus KL(q||p) has a significant impact on the behavior of the estimation process
affects the approximation of the true posterior distribution and can lead to different estimates
which might be more suitable for different applications depending on the desired outcome
(e.g., capturing all modes versus focusing on the most significant mode)

Properties of KL divergence

Positive (or zero): always greater than or equal to zero
Monotone: KL Divergence does not decrease as the probability mass function of q moves away from p.
Additive for Independent Distributions: If p and q represent independent distributions, then the divergence of their product is the sum of their divergences.
Not Symmetric: KL Divergence is not symmetric
Sensitive to Change of Scale: The measure changes if the scale of the probability distributions changes.
Integral or Sum Form: KL Divergence can be expressed as an integral for continuous distributions or a sum for discrete distributions.
Quantifies Information Gain: KL Divergence measures the amount of information gained by transitioning from one distribution to another, often interpreted in the context of how much information is lost when a distribution q is used to approximate another distribution p.

Understand Expectation Notation

Describe Evidence Lower Bound (ELBO)

Optimization Goals for approximating distributions / variational Inference

What is the Mean Field Theory ?

simplifies the problem by assuming that the unknown distribution can be factorized into M disjoint groups,
which means that the distribution over the latent variables Z is approximated by a product of independent distributions q_i(Z_i) for each group of variables

Steps for Applying Mean Field Theory

Identify the True Distribution: Begin with the given distribution $p(z)$ that we wish to approximate.
Assume Independence: Postulate that the complex distribution can be represented as a product of simpler, independent distributions for each variable.
Factorize the Distribution: Construct a factorized distribution q(z) as the product of the individual independent distributions for each variable.
Measure Divergence: Apply KL Divergence to assess the deviation of the factorized distribution q(z) from the true distribution p(z).
Iterate for Better Approximation: Modify the factorized distribution by iteratively considering the dependencies, which can lead to a closer approximation of the true distribution.

Attention

Known Distribution: Initially, only the true distribution p(z) is known, and the goal is to approximate it.
No Gaussian Assumption for q: The approximation does not need to assume that q is Gaussian; this is an outcome of the approximation, not a precondition.
Coupled Estimates: The estimates for the individual distributions in q(z) are interconnected, requiring iterative adjustments to refine the estimates and improve the approximation.

Steps in Variational Inference

Define Autoencoder

neural network model used for unsupervised learning
aiming to learn a compressed representation of the input data
by approximating an input x with a reconstrution r

Explain the Autoencoder components

Why care about latent space ?

Compact representation
Dimensionality reduction
"finding simpler representations"
information retrieval
Ideally approximates the real distribution of the observed data

When is an Autoencoder equal to PCA ?

Variants of Autoencoders

Applications of Autoencoders

Feature extraction
Dimensionality reduction
Denoising
Inpainting
Segmentation

Define Variational Autoencoder

A variational autoencoder is a

Generative Autoencoder: replace deterministic z with stochastic sampling operation
Directed model that approximates inference, i.e. distribution in latent space

Process of Variational Autoencoder

- What is the reparameterization trick ?

the stochasticity is separated from the parameters
enabling the use of backpropagation through deterministic computations.

- Why need the reparameterization trick ?

Problem: backpropagation through the sampling node, which is the difficulty of passing deterministic gradients through a stochastic sampling process.

How to train a variational autoencdoer?

Define encoder and decoder e.g. depth
Feed image(s) thorugh encoder
Sampling
Compute loss
Update parameters
repeat

Applications of VAE

Image Generation: VAEs can generate new images that resemble a training dataset, which is useful in art creation, design, and entertainment.
Anomaly Detection: By learning to represent normal data, VAEs can identify anomalies or outliers in datasets, which is valuable in fields like fraud detection or fault diagnosis.
Drug Discovery: VAEs help in generating molecular structures by learning the distribution of molecular data, thereby aiding in the discovery of new drug candidates.
Feature Extraction and Dimensionality Reduction: VAEs are used to learn lower-dimensional representations of data, which can serve as feature vectors for other machine learning tasks.
Semi-Supervised Learning: They can be employed in scenarios where only a small subset of data is labeled, leveraging the unlabeled data to improve learning efficiency.
Reinforcement Learning: In reinforcement learning, VAEs can learn to encode states and rewards, assisting in the creation of more efficient and generalizable policies.
Text Generation: VAEs can also be adapted for generating coherent and diverse text, and are used in natural language processing for tasks like dialogue generation and machine translation.
Speech Synthesis: They are used in generating human-like speech from text or other forms of data, which is useful in virtual assistants and other speech-based interfaces.
Style Transfer: VAEs can learn the style of one dataset and apply it to another, which is popular in image and video editing applications.
Interpolation: Because VAEs learn smooth latent representations, they can interpolate between data points to create transitions, such as morphing one image into another.

What is the ß-VAE

Idea: Learn disentangled representations by weighting the KL divergence term in the loss function by a factor β.

By manipulating β, the model can be tuned to focus more on the latent space factorization (β>1)
or on reconstruction fidelity (β=1).

faces generated by β-VAE and a standard VAE.
With a high β (250), β-VAE learns to disentangle the rotation of the face images, meaning that changes in a single latent variable correspond to changes in rotation only.
In contrast, the standard VAE with β=1 does not disentangle features as clearly, resulting in changes in identity and expression as well.

VAEs can yield blurry reconstructions
a higher value of β can lead to even blurrier images
as the model prioritizes learning disentangled representations over reconstructing high-fidelity images.

VAEs with constraints

The loss function for these VAEs includes

the standard reconstruction los (LreconstructLreconstruct), which ensures the output closely resembles the input.
- a prior distribution over the latent variables, which encourages their distribution to match some predefined prior.
- a flow term may refer to an additional constraint, possibly from normalizing flows, which are a method for improving the flexibility of the approximate posterior distribution in VAEs.
Facial Expression Manipulation Task:
- practical application of this constrained VAE framework
Loss Function Terms:
- The terms λ1 and λ2 are hyperparameters that weigh the contribution of the respective terms in the loss function
- allowing for balancing between accurate reconstruction, adherence to the prior, and the additional flow constraint.

What can generative models be used for ?

Density estimation
- Outlier detection
Synthetic data
- Data augmentation
- Missing data
- Semi-supervised learning
- Outlier detection
- De-bias dataset
Latent spaces
- Lower dimensional representation
- interpolation
Multi-modal outputs
Simulate possible futures for planning or simulating RL

What generative models do you know ?

Gaussian Mixture Models
PCA
Variational Autoencoders
Hidden Markov Models

Define GAN

GANs consist of two neural networks that compete against each other in a game theoretic scenario:

Discriminator (D)
- distinguishes between real data samples and fake data samples generated by the Generator.
- trained to output the probability that a given sample is real.
Generator (G)
- This network generates new data samples from random noise input z, which follows a probability distribution p_z(z)
- usually a Gaussian distribution N(0,I) where I is the identity matrix.

Data flow in a GAN

Real data samples are fed to the Discriminator
The Generator takes in random noise and outputs fake samples
The discriminator then assesses both real and fake samples and assigns a probability to the likelihood of each sample being real rather than fake

Objective Function of GAN

Why does VAE yield blurry images but GANs do not ?

VAEs
- VAEs learn an "explicit" distribution of the data,
- which involves directly modeling the probability density of the latent space
- It has two conditions: minimizing the reconstruction error and be close to a predefined distribution of the data
- when prioritizing the adhering to the distribution a lot, for example by the ß-parameter, it stays in contradiction with learning the exact reconstruction
- —> blurry images
Generative Adversarial Networks (GANs):
- GANs learn an "implicit" distribution of the data.
- There no explicit density estimation, which means GANs can focus more on producing sharp, high-quality images
- because they are not penalized for deviations in an explicit probability space, unlike VAEs

Bernoulli - GANs

How to train a GAN ?

Training iterations: The training process involves several iterations.
Training the Discriminator:
- A minibatch of noise samples is drawn from a noise prior p(z) and passed to the generator.
- The generator produces a minibatch of fake data samples from the noise samples
- A minibatch of real data samples is drawn from the data generating distribution p_data(x).
- Update the discriminator by ascending its stochastic gradient: This step involves feeding both real and fake data into the discriminator and adjusting its parameters to maximize the probability of correctly classifying real and fake data.
Training the Generator:
- Again, a minibatch of noise samples is drawn from the noise prior p(z).
- The generator uses this noise to produce a minibatch of fake data samples.
- These fake data samples are then passed to the discriminator, which classifies them as real or fake.
- The generator is updated by descending its stochastic gradient,
- The feedback from the discriminator (the probability of the fake data being real) is used to update the generator's weights. This is done by descending its stochastic gradient: encourages it to produce data that the discriminator will classify as real.

Iteration Example

Leftmost Panel: The initial state where the discriminator can easily distinguish between the data (solid blue line) and the generator's output (dotted green line).
Second Panel: The generator improves, and its distribution starts to overlap with the data distribution.
Third Panel: The generator gets even better, further overlapping with the data distribution, making it harder for the discriminator to differentiate.
Rightmost Panel: The discriminator's decision boundary (D(x) = 0.5) is shown where it is now uncertain about half of the generated data, indicating the generator has improved significantly.

Advantages and Disadvantages of GANs

Advantages

Loss function:
- Competition between networks is sole training criterion
unsupervised
No approximation of inference needed
Can represent sharp or degenerative distributions
Interpolation in latent space = interpolation between samples
If max V(G,D) is convex, procedure is guaranteed to converge

Disadvantages

No explicit representation of distribution
Only generator, no encoder, hence unknown latent code
little data typically leads to discriminator overfitting
Difficult to train
- Mode collapse
Quality metric

Quality Metrics for GANs

Variants of GANs

Embedding Process in GANs (Image to latent code)

Define Diffusion models

latent variable models
work with latent variables instead of observed variables
use the latent variables to generate data that resembles originial data
Dimensionality of latent variables
- have the same size and shape as the data you're working with.
- When generating images, latent variables will be structured like images

Training of Diffusion Models

Forward Process (Diffusion)

model starts with your data (i.e. image) and gradually adds random noise to it, step by step, until the data turns into pure noise

Reverse Process (Reconstruction)

the model learns how to reverse the process, essentially learning how to remove the noise step by step to reveal the original picture.
This is the actual generation process — it's like having a noisy image and cleaning it up to get a clear picture.

Training the Model

to train this model, you don't need to know exactly how to go from the noisy image back to the original one directly.
Instead, you use something called the variational lower bound, a technique from statistics that helps you estimate the reverse process.
It's like guessing the steps needed to clean up the foggy picture without knowing the exact way it was fogged up.

Explain the Forward Process in Diffusion Models

Explain the Reverse Process in Diffusion models

Extensions of Diffusion models

Diffusion Probabilistic Models (2015): The foundational concept of diffusion models was introduced, setting the stage for subsequent developments.
Denoising Diffusion Probabilistic Models (DDPM, 2020): A specific type of diffusion model that focuses on generating images by denoising, but it has the drawback of being slow in image generation.
Variational Diffusion Models (VDM, 2021): These models were introduced to speed up the optimization process. They prioritize optimizing the likelihood of data over the quality of the samples generated, aiming for faster training times.
Denoising Diffusion Implicit Models (DDIM, 2022): A variation of DDPM that uses non-Markovian diffusion processes (meaning the process does not strictly follow a memoryless Markov property) while retaining the same training objectives. DDIM models are significantly faster (10 to 50 times) than DDPM and allow for semantically meaningful interpolations in latent space.
Latent Diffusion Model (LDM, 2022): This model performs the diffusion process in a compressed latent space rather than in the pixel space, which can be computationally more efficient and can potentially generate higher quality images.
Classifier Guided Diffusion: An extension of diffusion models where guidance is provided by incorporating knowledge about different classes into the diffusion process. This can help in generating images that are more aligned with specific classes, enhancing control over the generation process.

Which component in a diffusion model is a Markov Chain ?

Forward Process as a Markov Chain:
- each step of adding noise to the data depends only on the state of the data from the immediate previous step.
- gradually transforms the data into a pure noise state over a series of steps.
- In each step, the future state X_t+1 is conditionally independent of past states given the current state X_t
- which is the defining characteristic of a Markov chain.
Reverse Process as a Markov Chain:
- each denoising step depends only on the state of the noisy data from the immediate previous step.
- The model learns to reverse the noise-adding process by predicting the clean data state at each step, conditioned only on the current noisy state.
- The goal is to ultimately reconstruct the original data from the noise.

What are elements of a Graph ?

Vertices or Nodes: points on a graph where lines intersect or end.
Edges and Directions: lines that connect the vertices. They can have directions, which means they point from one vertex to another.
Universal/Global Attributes: These are the characteristics or properties that apply to the entire graph.
Vector Representation:
- each of the elements mentioned above can be represented by a vector
- a vector has magnitude and direction.

What are different types of Graphs ?

Euclidean Graph

structured graph where the positions of the vertices are fixed and regularly spaced, much like a grid. (structure and neighbourhood in euclidean space)
In Euclidean space, the placement of vertices and edges is based on geometry, which means they have a consistent, ordered arrangement. (Geometric alignment)

Arbitrary Graph (Non-Euclidean)

does not have a regular structure.
vertices and edges are arranged in a complex pattern and can be irregular, meaning they don’t follow a predictable spacing or pattern.
Relationships between nodes in this graph can be non-linear, which means they don’t just move in straight lines or predictable curves.

Data from a Graph perspective

Data

represented by a graph with nodes (or vertices) connected by edges
nodes are colored, suggesting they have values or attributes associated with them.

Signal

represents the attributes or features of the nodes.
shown as separate nodes with colors, but no connecting edges.
like having different pieces of information or data points without yet understanding how they are connected.

Structure

the connections between nodes
it represents only the relationships or how each piece of data is related to another.
the framework or scaffold without the specific details (features) filled in.

—-

When combined, the signal and structure make up the data in the graph.
data in a graph is made up of the individual pieces of information (signal) and the way those pieces are connected or related (structure).
Both are crucial for algorithms to learn patterns and make predictions.

Graph structures

Image Pixels

Each square is a pixel with coordinates like (0-0, 0-1, etc.).
The color intensity can represent different features such as brightness or color in the image.

Graph

Image pixels can be converted into a graph format.
Each pixel is now a node (or vertex) in the graph, and the nodes are connected with edges.
The edges might represent the adjacency or relationship between pixels, such as proximity in the image.
Certain nodes are highlighted, indicating special features or importance, like a key part of the image that may be the focus of analysis.

Adjacency Matrix

a square matrix used to represent a finite graph.
elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.
a blue square indicates an edge between two pixels (nodes), and a white square indicates no edge.
provides a numerical way to encode the structure of the graph, which can be processed by algorithms.

What is a adjacency matrix ?

table that shows which nodes (points in the graph) are connected to which. The matrix has rows and columns labeled by graph vertices.
can be "sparse" which means that most connections are absent (most of the matrix is zero),
can be "banded" indicating that only adjacent nodes are connected (so non-zero values appear in a band or stripe pattern).
can also represent the "weight" of the connections, meaning it shows not just if two nodes are connected, but also how strong or important that connection is.

What is a Adjacency List ?

a more "condensed" way of representing a graph that takes up less space when the graph has fewer connections.
for each node, you have a sublist of other nodes that it's connected to.
It's easy to add new nodes to this representation because you just add another list.
The weights of the relationships can be included as a third value in the sublists.

Geometric Learning Principles: Need Geometric Priors

principles of geometric learning, specifically the need for geometric priors, which are assumptions or inbuilt knowledge about geometry in the learning models.

Geometric Learning for Learning Stable Representations

Graph Neural Networks

GNN - Basic Example

Tasks for GNNs

Convolutional on Graphs

Convolutional Graph Neural Networks

Pooling on Graphs

Convolutional GNNs or Graph Convolution Networks (GCN)

Graph AutoEncoder

Message passing in Graph networks

Define Domain Adaptation

In many cases, source domain and target domain are different
- Example: graphical images vs. real photos
- Example: Movie reviews vs. product reviews
Goal: Fit model to source domain, then modif parameters to be compatible to target domain —> domain adapation
Definition: Vast but task-specific field of how can a classifier learn from a source domain and generalize to a target domain

- Define Multi-Task Learning

simultaneously train task on different special target domains

- How do Multi-Task Learning models work ?

Learning with Multiple Heads

model has multiple outputs, each corresponding to a different task
The same underlying model (with shared parameters) is used
it can handle different types of input and output, depending on the task

Key Questions

Conditioning the Model for Individual Tasks
- How do we design the model so that it can handle different tasks effectively?
- Should we have completely separate parts with no shared parameters,
- or should we use a 'multi-headed' approach where there is a common core followed by task-specific layers?
Forming the Objective
- How should the goal of the learning process be defined when there are multiple tasks?
- Should we prioritize some tasks over others (weighted),
- or should the model learn how to balance the tasks on its own (dynamically)?
Optimizing the Objective
- How do we adjust the model to meet this objective?
- One approach mentioned is to use mini-batches, which are small subsets of data.
- We can either sample a mini-batch of different tasks
- or a mini-batch of datapoints for each task during the training process.

Multi-task Learning Challenges

Negative Transfer

learning one task adversely affects the performance on another.
Instead of the knowledge from one task helping another, it hinders it.
Cross-task inference:
- interference between tasks can lead to negative transfer
- the model incorrectly applies what it has learned from one task to another.
Meta-parameter differences
- Different tasks may require different settings for their meta-parameters (like learning rate, number of layers, etc.)
- finding a set that works for all tasks can be challenging.
Limited representational capacity
- The model might not have enough capacity to learn all the tasks effectively, which can lead to a decrease in performance.

Small Data Sizes & Overfitting

model learns the details and noise in the training data to an extent that it negatively impacts the model's performance on new data.
sharing more data among tasks can act as a form of regularization, which helps to prevent overfitting by encouraging the model to learn more general patterns that apply across tasks.

How to Choose Task Combinations?

Deciding which tasks to combine in a multi-task learning scenario is not straightforward, especially if there are many potential tasks.
Compute "Inter-task affinities"
- evaluate how related or beneficial the tasks are to one another, to inform the combination of tasks
- a network with connections between tasks such as segmentation, keypoints, edges, normals, and depth, likely representing the inter-task relationships.

Multi-task Learning Generalisation

General Goal
- create models that can generalize well to new, unseen data.
- the model should be able to make accurate predictions or decisions on data it hasn't encountered during training.

Domain Generalization
- concept of training a model on multiple domains
- The goal is to achieve low loss on new data distribution, which means the model should make as few errors as possible on the new domain it's tested on.
Mathematical Formulation

represents the objective of finding the best function which minimizes the expected loss across all tasks.
way to formalize the learning process to make sure the model performs well on new data (domain generalization).
This is further illustrated with a formula that combines the losses from all tasks during training to achieve the best performance on the test domain.

Model diagram

a general model with a shared parameter set θ) that branches out into multiple "task heads".
Each head ψ corresponds to a different task the model has learned during training.
There is also a "task head" for the test domain (ψ_{test}), which indicates that the model is trying to apply what it has learned to a new domain.

How to Learn
- meta-learning is a strategy to improve learning
- "learning to learn"
- involves designing models that can learn new tasks with minimal data by effectively leveraging past knowledge.
- higher-level learning process where the model is not only learning the tasks but also learning how to adapt to new tasks more efficiently.

Define Meta-Learning

Basic Learning

standard process of training a machine learning model on a dataset to learn a specific task, like distinguishing between images of dogs and otters.

Meta Learning

Meta learning goes beyond basic learning.
the algorithm itself learns how to adapt to new tasks quickly with minimal data.
involves training on a variety of "special target domains" (different types of tasks or data, like images of various categories)
and then testing the model's ability to apply what it has learned to a new, unseen domain.

Gradient-based: Model-Agnostic Meta-Learning (MAML)

Meta Learning - Different approaches

Define Few-Shot Learning

technique where the model is designed to learn from a very small amount of data, referred to as "few shots".
In the context of meta-learning, few-shot learning refers to the model's ability to adapt to new tasks with very limited data.
Zero-shot learning is mentioned as a related concept where the model makes predictions for tasks without having seen any labeled examples at all.
C-way N-shot Terminology
- "C-way" refers to the number of classes (distinct categories) involved in a task.
- 3-way" would mean there are three different classes.
- "N-shots" indicates the number of examples per class.
- "2-shot" means there are two examples for each class.
if tasks do not share classes, the learning process forces the model to focus on features that are specific to each class, rather than relying on generic features or shortcuts (like the background of images).

Define Reinforcement Learning

type of machine learning where a program (often called an agent) learns to make decisions by trying out actions and seeing what happens
much like how a person learns to play a new game by trying different moves and remembering which ones worked well (trial and error)

Variables of RL

State S_t

situation the program finds itself in at any given moment.
In a game, this could be the arrangement of pieces on a board.
The state is what the program observes and uses to decide what to do next.
sometimes the program doesn't get to see everything (partial observation o_t) and must make the best guess.

Action a_t(S_t)

Based on the state, the program decides to take an action.
For example, in a game like chess, an action would be moving a pawn.
The action depends on the current state of the game.

Reward r_t:

After the program takes an action, it gets a reward based on how good that action was.
This is like getting points in a game. The reward helps the program understand if the action it took was beneficial or not.

Policy π_θ:

the strategy the program uses to decide which actions to take as it goes along.
Think of it as the program's game plan or set of rules it follows to try to win or achieve its goal.

Goal of RL

Goal

aim is for the program to select actions that will give it the most reward not just immediately but in the long term too.
It's like planning several moves ahead in a game instead of just the next move.

Actions and Consequences

Actions the program takes can have long-term effects,
which means a good action now could lead to more rewards later on.

Delayed Reward

Sometimes, the reward for an action isn't immediate.
The program might have to wait a while to see if an action was truly good.

Advantages & Disadvantages of RL

Advantages

Partial target information

can learn effectively even when they don't have full information about the outcome of their actions.
They're designed to work with incomplete knowledge about the environment.

Good in low-dimensional (low-D) tasks:

effective in problems where the number of factors to consider (dimensions) is relatively small,
which can include many real-world scenarios.

Disadvantages of RL

Sample inefficiency:
- need a lot of trial-and-error before learning how to do something well
- This means they require a lot of data (or samples) from the environment, which can be inefficient.
High variance and instability:
- The measures of performance (like gradients, which show the direction to improve) can be inconsistent (high variance).
- This makes the training process unstable and challenging.

Different Approaches in RL

Value-based Reinforcement Learning

Q-Funtion

SARSA-Algorithm

Q-Learning

Policy Gradient Reinforcement Learning

Deep Reinforcement Learning Task

Deep Q-Network

Model-based RL

Meta Reinforcement Learning

How is reinforcement learning different from (un)supervised ML ?

RL learns from interactions and optimizing actions based on rewards over time focusing on making a sequence of decisions
SL learns from labeled examples with immediate error correction
UL is about finding structure in data without labels or rewards

Join Course

Preview

Author

Alexander R.

Information

Last changed
a year ago

Report course