Define Machine Learning
field of understanding and building methods that leverage data to perform a certain task
Method learns from Experience E with respect to some class of Task T, and performance measure P
Performance at Task T improves (measured by P) with experience E
What are the different learning methods in ML ?
Supervised Learning
Labelled data
Can perform classification or regression
use of continuous or discrete labels
Unsupervised Learning
Unlabelled data
Reinforcement Learning
Agents performing actions in environment
Difference between supervised and unsupervised learning?
Supervised
Unsupervised
What is semi-supervised ?
partly labelled data
What is weakly supervised ?
noisy labels
Examples of supervised machine learning methods
Linear Regression
Bayesian Regression
Support Vector Machines
Examples of unsupervised machine learning methods
PCA
Cluster algorithms (K-Means, Gaussian Mixture Models)
Autoencoder, GAN
Tasks of Machine Learning
- Classification
- Regression
- Text-to-speech
- Synthesis & Sampling
- Probability density estimation
- Anomaly detection
- De-noising
Performance Metrics of Machine Learning
- Distance
- Euclidean distance
- Manhattan distance
- Supervised
- Confusion Matrix
- MSE/RMSE
- Accuracy
- Precision
- Recall
- Unsupervised
- Pairwise correlation
- Mahalanobis distance
- Inter-/Intra-Cluster Distance
What is Precision?
What proportion of positive "predictions" was actually correct?
What is Recall ?
What proportion of actual positives was identified correctly?
Explain the confusion matrix
- True Positives: Positive predicted and actual positive
- False Positives: Positive predicted and actual negative
- True Negatives: Negative Predicted and actual negative
- False Negatives: Negative Predicted and actual positive
What is the standard linear regression equation?
General linear regression model
Assumption: target variable t is the sum of the predicted value of model y(x,w) and some noise epsilon
What is the distribution of the noise that is assumed in Linear Regression ?
epsilon follows a normal distribtuion
with precision β
precision β is the inverse of the variance of the normal distribution.
In mathematical terms, if β=1/σ**2
a higher precision means a lower variance
What is the Maximum Likelihood Estimate (ML) ?
statistical method used for estimating the parameters of a model.
choose the parameter values that make the observed data most probable.
given a set of data and a statistical model, MLE finds the parameter values that maximize the likelihood function,
which measures how well the model explains the observed data.
How to get the MLE in linear regression ?
objective: minimise the Least Squared Error between the target t and the estimate y(x)
find the minimum of E: differentiate with respect to w and set derivative equal to zero
solve the normal equations, we get
solving this for w, we find the least squares estimate
What is the Maximum A Posteriori (MAP) Estimate ?
method used in Bayesian statistics to estimate a model's parameters.
closely related to the Maximum Likelihood Estimate (MLE),
key difference: MAP incorporates prior knowledge about the parameters through a prior distribution.
MAP estimate is found by maximizing the following:
the evidence is the same for all parameter values (it is a constant based on the observed data)
not only about fitting the model to the data (as in MLE) but also about fitting it in a way that is consistent with what was believed about the parameters before the data was seen.
can lead to different estimates from MLE, when the prior is strong or the data is limited.
What is multivariate linear regression ?
technique that models the linear relationship between
multiple independent variables (also known as predictors or features)
and a dependent variable (also known as the response or outcome).
What is the design matrix ?
matrix that captures all the data used to make predictions about your dependent variable
the design matrix includes:
Intercept Term: If your model includes an intercept term (also known as a bias), the first column of the design matrix is typically a column of all ones.
Independent Variables:
Each column in the design matrix represents one independent variable (also known as a feature or predictor) in your dataset.
Each row represents an observation.
What are basis functions ?
functions used to represent the data within some space in a way that makes it easier to model.
basis functions transform the input variables into a new space where linear relationships can be more easily detected and modeled.
way to understand basis functions:
Simple Linear Basis Functions: the basis functions are just the identity function of the predictors. For example, if you have one predictor x, the basis function is phi(x)=x and the model is just y= c+w*x
Polynomial Basis Functions: polynomial basis functions like x,x^2,x^3...x^n; allow to model nonlinear relationships while still using linear regression techniques
Why use basis functions ?
transform the input space,
allowing for more flexibility in modeling relationships between the independent variables (features) and the dependent variable (target).
effects and benefits of using basis functions in regression models:
Modeling Non-Linear Relationships:
can capture non-linear relationships by transforming the input data into a higher-dimensional space
where the relationship between the input and the output becomes linear.
For instance, a quadratic basis function can allow a linear model to fit a parabolic curve.
Increased Model Complexity:
effectively increase the complexity of the model.
advantageous if the true relationship between variables is complex and the basic linear model is not sufficient.
Improved Fit:
The use of appropriate basis functions can lead to a better fit of the model to the data, which can result in more accurate predictions.
What is the recipe for ML linear regression ?
Learn model parameters: Construct target vector t and design matrix ; then solve likelihood function for the weights; then you are ready to make predictions
Apply the model to the data in the test set; Evaluate the RMSE between regressed estimate and measured target variable
Differences between MLE and MAP ?
Prior Information:
MAP incorporates prior knowledge through a prior probability distribution. ML does not
Objective:
ML maximizes the likelihood of observing the data given the parameters.
MAP maximizes the posterior probability of the parameters given the data and prior.
Results:
ML estimates are purely data-driven.
MAP estimates are influenced by both data and prior beliefs.
Convergence with Data:
For large data sets, MAP estimates may converge to ML estimates, assuming the prior is not extremely strong.
Define regularization
used to prevent overfitting
by adding a penalty on the size of the model parameters to the loss function used to train the model.
This encourages the model to be simpler, making it generalize better to new data by keeping the weights small and reducing the model's complexity.
Common types of regularization include L1 (Lasso) and L2 (Ridge) regularization.
Example
Lasso vs. Ridge Regression (and when to use what?)
Use L2 (Ridge) Regression when:
multicollinearity in your features
want to include all features
more features than observations
Use L1 (Lasso) Regression when:
want a sparse model
want to perform feature selection
a limited dataset: When you have a smaller dataset, Lasso can help by selecting only the most important features, which may lead to better performance on unseen data.
Reflect Bayesian regression for linear models
incorporates prior beliefs about parameters
updates these beliefs after observing data.
helps to regularize the solution, preventing overfitting.
1. Define the Prior
Establish the prior distribution for the parameters, typically a Gaussian distribution, `p(w) = N(w|m0, S0)`.
`m0` and `S0` represent the mean and covariance of the prior belief about the parameters.
2. Formulate the Likelihood Function
The likelihood function `p(T|X, w, β)` describes the probability of observing the target values `T` given the input features `X`, parameters `w`, and noise precision `β`.
3. Determine the Posterior Distribution
Using Bayes' theorem to calculate the posterior distribution, `p(w|T, X, β)`,
updates our beliefs about the parameters after observing the data.
the posterior is proportional to the product of the likelihood and the prior.
4. Compute the Posterior Mean and Covariance
Calculate the mean of the posterior distribution using the formula `mN = SN(S0^(-1)m0 + βΦ^T T)`.
Determine the covariance of the posterior distribution with `SN = (S0^(-1) + βΦ^T Φ)^(-1)`.
5. Interpret the Posterior as a Normal Distribution
the posterior is still a Gaussian distribution with updated mean `mN` and covariance `SN`.
the peak of the posterior distribution is the most probable estimate of the parameters (MAP estimate), which is `w_MAP = mN`.
6. Role of the Conjugate Prior
a conjugate prior is chosen so that the posterior distribution remains in the same family as the prior, simplifying calculations.
7. Role of Evidence
evidence, `p(T|X, β)`, is not incorporated when we're only interested in the parameter values that maximize the posterior.
8. Benefits of Bayesian Linear Regression
Bayesian regression accounts for uncertainty in parameter estimates.
it prevents overfitting by incorporating prior knowledge and updating beliefs in a principled manner.
Relate Bayesian sequential learning to regression
sequential learning is the process of updating beliefs about model parameters with each new piece of data.
it treats the posterior from previous data as the new prior.
Update the Prior with New Data
When a new data point arrives,
use the posterior distribution from the previous `N-1` samples as the prior for the new data point.
Calculate the New Posterior
Formulate the posterior probability for the first `N-1` samples
then update this posterior with the likelihood of the new `N-th` data point to get the posterior for `N` samples.
Sequential Update Equation
sequential update for the posterior probability with the new data point, which involves multiplying the previous posterior by the likelihood of the new data point.
Advantages
of allowing for the model to be updated in real-time as new data comes in.
Data arrives sequentially over time, such as online learning or real-time prediction systems.
Apply Full Bayesian approach by computing the predictive distribution by integration over all models
Recall that the posterior distribution `p(w|T, X, α, β)` is a Gaussian with mean `mN` and covariance `SN`.
`mN` and `SN` are derived from the observed data.
The predictive distribution `p(t|x, T, α, β)` allows for making predictions about new, unseen data.
This distribution is calculated by integrating the noise model with the posterior distribution over the parameters.
The predictive distribution is obtained by integrating the product of the likelihood (noise model) `p(t|x, w, β)` and the posterior `p(w|T, X, α, β)` over all possible weights `w`.
The likelihood function for the noise is given by a Gaussian distribution, which represents deviations of the observed target values from the model predictions.
the integral of the product of two Gaussians is itself a Gaussian.
The resulting Gaussian represents the predictive distribution.
The predicted mean `y(x, mN)` is the mean of the predictive distribution, which uses the posterior mean `mN` of the parameters.
Determine the predictive variance `σ^2_N(x)` which quantifies the uncertainty of the prediction.
The predictive variance includes a term from the noise model and the posterior covariance.
the resulting predictive distribution `N(t|y(x, mN), σ^2_N(x))` provides a mean and a variance for the prediction, capturing the uncertainty.
the Full Bayesian approach is focused on predicting distributions, not just point estimates, which is essential for capturing uncertainty in predictions.
Flat prior
No data
Describe Linear Discriminant Functions
used to find a linear combination of features that separates two or more classes
resulting combination is used as linear classifier
The decision rule for LDA is a linear equation:
It creates a decision surface in the feature space where y(x) = 0, dividing the space into regions for different classes.
An input is assigned to a class C_1 if y(x) >= 0 and to class C_2 if y(x) =< 0, assuming the problem is binary classification and the classes are linearly separable
Advantages of LDA
learns the boundaries directly from the data.
no need to estimate the probability density function (pdf) of the data.
weights in the function give insight into the model: the sign indicates positive or negative effect, and the magnitude indicates the importance of a feature.
What approaches can be used for LDA ?
Least squares
Fisher’s Linear Discriminant
Perceptron
How does the LDA method - Least Squares work ?
Given a dataset with features and labels
x_n represents the feature vector
t_n represents the target label
aim to find weights that minimize the sum of squared differences between predicted values and actual target values
X is the design matrix
W is the matrix of the Weights
T is the target matrix
The subscript F denotes the Frobenius norm (measure the size of a matrix)
Pro’s and Con’s for LDA-Method Least Squares
Pros
Closed form solution
Cons
Not robust (sensitive to outliers)
Output are not probabilities (not constraint to (0,1))
Pro’s and Con’s for LDA-Method Perceptron
Suitable for large datasets as it processes one sample at a time
Guaranteed to converge to a solution if classes are linearly separable
Con
no unique solution; depends on initial weights and order of data points
will not converge if classes are not linearly separable
does not generalize to multi-class problems (only for 2 classes)
Outputs are not probabilities (not constraint to (0,1))
Pro’s and Con’s for LDA-Method Fisher’s Linear Discriminant
Pros:
Dimensionality Reduction
suitable for multi-class problems
closed form solution
Cons: (works best if):
Class means should differ
Gaussian distribution within classes
Similar sample sizes among classes
Mathematical Approach for Fisher’s Linear discriminant
Minimize within-class scatter
Maximize between-class scatter
objective is to maximize the Fisher criterion, which is the ratio of the between-class scatter to the within-class scatter.
To find the best projection we need maximize J(w)
c is a constant
the rest of the equation gives the direction that maximizes the separation between projected class means while also considering the spread of the classes
Step-by-step LDA - Perceptron
Input data D (x is feature vector, t is corresponding target label -1,1)
Minimize the misclassifiaction error with stochastic gradient descent and iteratively
What are Probabilistic Generative Models ?
assume that the data for each class is generated from a Gaussian distribution
for class c_k, the probability distribution function is:
decision rule for classification is to choose the class that maximizes the discriminant function
LDA for Probabilistic Generatve Models
1. Assume Gaussian Distribution for each class
2. Calculate Mean and Covariance for each class
3. Compute Prior Probabilities for each class
4. Decision Function: The decision function for each class is derived from the log of the posterior probability. By applying Bayes' theorem and simplifying, the decision function can be expressed as shown on your slide.
5. Classify New Samples: compute the decision function for each class and assign x to the class with the maximum value of the decision function.
LDA typically makes two additional simplifications to the covariance matrices:
Common Covariance Matrix: assumes that all classes share the same covariance matrix, Σ, which leads to linear decision boundaries.
Diagonal and Equal Variances: In the simplest case, LDA can further assume that the covariance matrix is diagonal with equal variances across all dimensions, which simplifies the computation even further.
Probabilistic Generative Models - Cases of Assumptions
- Different covariance cases
Individual Covariances (no linear decision boundary)
One common covariance
One common diagonal covariance
One common diagonal and equal variance
—> Check that assumptions are made by the data
—> Preprocessing is important!
Define what kernel methods are
are a class of algorithms
map data into a higher-dimensional space using a kernel function
making it easier to perform linear separations between classes.
Commonly used in support vector machines (SVMs), for classification, regression, and other tasks.
measures similarity between pairs of data points in the original space without explicitly performing the transformation.
Popular kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid.
They enable complex decision boundaries in the original feature space, improving the flexibility and accuracy of machine learning models.
How would you apply kernel methods to new problems ?
Understand the problem domain
Choose an appropriate kernel
Preprocess the data: Prepare your data by cleaning (removing noise and outliers) and normalizing it if necessary. This step is crucial for the effective application of kernel methods.
Split the data
Train the model: Use a kernel-based algorithm like SVM to train your model on the training set. During training, the algorithm uses the kernel function to transform the data into a higher-dimensional space, where it finds the optimal boundary between classes or regression fit.
Tune hyperparameters
Evaluate the model
Apply to new data: Once satisfied with the model's performance, you can apply it to new problem instances.
Define what a Gaussian Process is
a probability distribution over possible functions that fit a set of points. For any collection of points, the joint distribution of the GP's outputs is multivariate Gaussian.
In contrast to Bayesian linear regression, which models target variables with a linear function and Gaussian noise, GPs model the target directly, without specifying intermediate weights or assuming a particular functional form.
The GP is fully characterized by a mean function (often assumed to be zero for simplicity) and a covariance function or kernel, which defines the relationship between different points in the input space.
The covariance matrix derived from the kernel function captures the essence of the GP. The kernel defines the smoothness and general behavior of the functions drawn from the process.
For new samples, GPs provide a predictive distribution which is also Gaussian, giving not just an estimate for the target but also the uncertainty associated with that estimate.
used in regression tasks to make predictions about new data points. The GP's ability to provide a measure of uncertainty with its predictions is particularly useful in many applications, like optimization and active learning.
How does the Kernel trick work ?
enables them to operate in a high-dimensional space without explicitly computing the coordinates of the data in that space.
Linear Regression with Kernels
by transforming input data into a high-dimensional space, allowing linear models to capture non-linear relationships.
Least Squares Error
cost function for regression minimizes the sum of squared errors between predictions and actual target values, with a regularization term to prevent overfitting.
Dual Representation
weights are expressed in terms of a new parameter set, which is a linear combination of the input data,
utilizing a kernel matrix that encapsulates the relationships between data points in the transformed space.
Rewriting the Cost Function
The cost function is reformulated in terms of the dual parameters and the kernel matrix, which is used to minimize the cost and train the model.
Advantages of the Dual Form
Only the kernel matrix is utilized, bypassing the need to compute high-dimensional feature vectors.
improves computational efficiency, especially when the transformed feature space is much larger than the number of data points.
Why Use the Dual Form?
Inverting a matrix related to the number of features (in the transformed space) is computationally costly.
The dual form involves inverting a matrix related to the number of data points, which is typically smaller and computationally less expensive.
What is a valid kernel K with elements k(x_n,x_m) ?
also known as Gram matrix K
symmetric
positive semidefinite (any vector transposed and multiplied by $K$ should result in a non-negative scalar)
Low for small distance = high similarity (higher kernel values)
High for High distance = low similarity
Define what SVMs are
supervised machine learning algorithm
SVMs aim to find the best separating hyperplane that divides the data into classes.
"best" hyperplane is the one that maximizes the margin between different classes, where the margin is defined as the distance between the hyperplane and the nearest data points from each class, known as support vectors.
Support vectors are the data points nearest to the hyperplane; the position of the hyperplane is entirely dependent on these points.
In classification tasks, SVMs can separate data points into two or more classes with a hyperplane in the feature space. For two classes, the goal is to find the optimal dividing line (in 2D), plane (in 3D), or hyperplane (in higher dimensions).
SVMs can employ the kernel trick to handle non-linearly separable data. Kernel functions implicitly map input features into high-dimensional feature spaces where a linear separation is possible
Advantages:
effective in high-dimensional spaces, even when the number of dimensions exceeds the number of samples.
memory efficient since they use a subset of training points in the decision function (support vectors).
They can provide different kernel functions for the decision function and can specify custom kernels.
Limitations:
do not directly provide probability estimates, which are calculated using an expensive five-fold cross-validation.
can be less effective on very large datasets or datasets with a lot of noise.
What is the Max-Margin Problem
Max-Margin problem is a formulation used to find the hyperplane that maximizes the margin between two classes in the feature space.
Objective Function
The goal is to find a hyperplane with the smallest possible weight vector w that still correctly classifies the data points.
Mathematically, it's about minimizing the term below
The minimization is subject to constraints ensuring that all data points \( x_n \) are classified correctly, formalized in:
Lagrange Multipliers
To solve this constrained optimization problem, Lagrange multipliers a_n are introduced for each constraint.
The Lagrangian L(w, w_0, a) is constructed by combining the objective function with the constraints, weighted by the Lagrange multipliers.
The problem transforms into minimizing the Lagrangian with respect to w and w_0, and maximizing it with respect to a_n.
Gradient Equations and Dual Representation
setting the gradients of the Lagrangian with respect to w and w_0 equal to zero, conditions are derived that allow w to be expressed as a sum of the data points x_n, scaled by the product of their corresponding Lagrange multiplier a_n and label t_n
A condition that the sum of the products of the Lagrange multipliers and labels must equal zero is also derived.
This leads to the dual representation which only depends on the Lagrange multipliers a_n, and the problem can be reformulated into maximizing the dual representation L(a)
4. **Kernel Function**:
- The kernel function \( K(x_n, x_m) \) represents the inner product of \( \phi(x_n) \) and \( \phi(x_m) \) in the feature space, allowing the SVM to work in a higher-dimensional space without explicitly computing the coordinates.
5. **Maximizing the Dual Representation**:
- The dual problem involves maximizing \( L(a) \) under the constraints that all \( a_n \) are non-negative and the sum of the products of \( a_n \) and \( t_n \) equals zero.
- The dual representation simplifies the problem as it removes the need to work directly with the weight vector \( w \) and allows the use of the kernel trick.
Through the Max-Margin problem, SVMs find a hyperplane that not only separates the data into classes but also stays as far away as possible from the closest data points of any class, aiming for a better generalization to new data points.
How can you train a SVM ?
Step 1: Estimate the Bias w_0
Assuing the support vectors are known, estimate the bias w_0 using the formula provided.
This involves the labels of the support vectors, the Lagrange multipliers, and the kernel evaluations.
Step 2: How to get the support vectors defined by a_n
no direct or feasible way to find support vectors just from multipliers a_n
Need to estimate the Lagrange Multipliers
Sequential Minimal Optimization (SMO)
SMO is a common method for solving the optimization problem of SVM efficiently.
SVM - Pros and Cons
Define Correlation
Measure of how much two variables are "linearly" related
If one variable tends to go up when the other does, they have a positive correlation.
If one goes up while the other goes down, they have a negative correlation.
If changes in one variable don't consistently relate to changes in the other, they might have no or very little correlation.
Correlation is neither good or bad, it depends on application
Pearson correlation coefficient:
Define Principal Component Analysis
Goal of PCA is to project data onto fewer dimensions while keeping as much variation in the data as possible
Find projection that
maximizes variance
minimizes reprojection error
Find a low-dimensional space such that
when x_n is projected there, "information loss" is minimized
new directions must be uncorrelated which means that the covariance matrix is diagonal
Compute Principal Component
Define Eigenvector
a vector that changes at most by a scalar factor when a linear transformation is applied
a vector whose direction remains unchanged when a linear transformation is applied to it
PCA - How to compute Principal Components
Compute Mean (to center data)
Compute Covariance Matrix (M is a matrix where each row is a mena vector)
Get eigenvectors (A - \lamdba * I = 0)
Choose K=1
PCA for high dimensional data
Properties of PCA
PCA is an unsupervised method
It is deterministic, providing the same output for a given input data.
PCA has an analytical solution, which means the principal components are computed through direct computation rather than iterative methods.
involves creating a linear combination of samples to form new principal components.
Critical considerations for PCA include:
Preprocessing of data is crucial for effective PCA.
The number of principal components to retain must be determined.
PCA is not as suitable for very large datasets.
Independent Component Analysis (ICA)
ICA aims to find a set of independent components from the observed data.
Unlike PCA which finds uncorrelated components, ICA finds components that are statistically independent.
used to solve problems like the cocktail party problem, where the goal is to separate mixed signals into their original sources.
PCA vs. ICA
Applications of PCA
PCA is used for data compression and dimensionality reduction.
It serves as an important step in data preprocessing.
PCA aids in creating models for:
Approximating original datasets.
Interpolating data for smoother transitions.
Generating new data samples based on the principal components.
A caution is given that PCA does not inherently understand the semantic meaning of the directions in reduced space.
What are the Principal Components and what property do they have ?
vectors that capture the underlying structure of the data in a dataset after PCA was applied
Each PC represents the direction in the dataset along which the variance is maximized. The first principal component captures the most variance, the second principal component captures the second most…
PCs are orthogonal to each other in the feature space.
PC are constructed as linear combinations of the original features
Each PC is associated with an eigenvalue from the covariance (or correlation) matrix of the data. The eigenvalue measures the amount of variance captured by its corresponding principal component.
Cost Function & Responsibilities of K-Means
Minimizing K-Means Cost Function / Expecation Maximization Algorithm
E-step: Each data point gets assigned to the nearest cluster center, effectively minimizing cost function J with respect to the responsibility r_nk
M-step: Once the points are assigned, the cluster centers μ_k is recalculated by averaging all the points in each cluster. This minimizes J with respect to μ_k
Local convergence is guaranteed
Global minimum is not guaranteed
Limits of K-means / Alternatives
Hard assignments of data points to clusters —> small shift of data point can flip it to a different cluster
Not clear how to choose value of K
Alternative: replace 'hard' clustering of K-means with "soft" probabilistic assignments
Define Gaussian Mixtures
A function that is made up of several Gaussian (normal) distribution
Each of these distributions represents a cluster within the data,
They are combined (or 'mixed') to model the overall distribution of the data.
Sampling from a Gaussian Mixture
In simulation, the mixture parameters are known
To generate a data point:
Draw one of the components k with probabilities p(k)
Draw a sample x for each new data point
Repeat for each new data point
How to fit Gaussian Mixture Models
Expectation-Step:
Calculate responsibilities
Maximization-Step:
Update the means
Update covariances
Update mixing coefficients
Differences between hard clustering and fuzzy clustering
Hard Clustering
Each data point belongs to exactly one cluster.
Clear boundaries between clusters.
Easy interpretation and implementation.
Use when distinct classifications are needed.
Fuzzy Clustering
Data points can belong to multiple clusters and belong to each cluster with a certain degree of membership
Soft boundaries; data points can be shared among clusters.
Use when data exhibits overlap between classes or when the boundaries are not clear.
Different Covariance Shapes for GMMs
Spherical and Diagonal assumptions are computationally less demanding and may prevent overfitting on small datasets but at the cost of model flexibility.
Tied and Full covariance matrices provide more flexibility to capture complex data structures but increase the risk of overfitting and require more computational resources.
Reflect the structure of MLP networks
Contains an input layer, (min.) one hidden layer, and an output layer
A hidden layer processes the inputs x through weights and a bias and applies an non-linear activation function
Output layer computes the final output y using the hidden layers outputs, own weights and biases
How to Regression with NN ?
How to Binary clasification with NN
How to Multiclass classification with NN ?
Explain the role of MLPs in deep learning
DL utilizes MLPs with multiple hidden layers
MLPs learn a hierarchy of features, with each layer capturing increasingly abstract representations of the data
They can approximate any continuous function, making them versatile for various tasks
MLPs use backpropagation for efficient training, allowing them to learn from data by adjusting weights to minimize error
serve as the basis for more specialized deep learning architectures like CNNs and RNNs
The output y of the network is a composition of functions corresponding to each layer's transformation, including the activation functions h and the layer weights and biases W,b
Explain backpropagation
What are common problems when training deep neural networks ?
Vanishing Gradient
Exploding Gradient
Overfitting/Underfitting
Hyperparameters
Parts of a neuron
What activation functions are there ?
Explain Vanishing Gradient
In deep networks, gradients can become very small
exponentially decreasing as they propagate back through the layers.
makes it hard to update the weights in the earlier layers
Activation functions like ReLU can mitigate this issue, because its derivative does not saturate
Explain exploding gradients
If the gradients are large, their effects can get multiplied through the layers, leading to even larger gradients.
leading to large changes in weights and unstable training
Gradient clipping is a common solution.
Reasons for exploding gradients:
In very deep networks, gradients can accumulate through layers. If the gradients are large, their effects can get multiplied through the layers, leading to even larger gradients.
If the network's weights are initialized too high or if the weights grow too large during training, the gradients can
non-saturating functions like ReLU can contribute to the exploding gradients problem because their derivatives can be large. For example, the derivative of the ReLU function is either 0 or 1, and during backpropagation, this can lead to large gradients if many ReLU units are active at once.
Batches in Gradient Descent
Training algorithms for Neural Networks
Explain overfitting
Regularization in Neural Networks
Explain a CNN
Structured for 2D Data: CNNs are specialized neural networks for processing data with grid-like topology (e.g., images).
Layered Architecture: Typically consists of convolutional layers, pooling layers, and fully connected layers.
Convolutional Layers
Utilize filters/kernels to perform convolution operations.
Capture spatial features like edges, patterns, and textures.
Each filter detects different features by sliding over the input image.
Activation Functions
Applied after convolution to introduce non-linearity.
ReLU (Rectified Linear Unit) is commonly used.
Pooling Layers
Reduce spatial dimensions (downsampling).
Make the detection of features invariant to scale and orientation.
Common methods include max pooling
Fully Connected Layers:
Neurons have full connections to all activations in the previous layer.
Integrate learned features from convolutional and pooling layers for classification.
Output Layer
Gives the final prediction, often using a softmax function for classification tasks.
Learnable Parameters
Weights in filters and fully connected layers are learned during training.
Spatial Hierarchy of Features:
Early layers capture low-level features; deeper layers build up to high-level features.
Efficiency
Share weights and use fewer parameters compared to fully connected networks, making them computationally efficient.
Explain Padding & Striding
Explain Batch Normalisation (CNN)
Explain Pooling (CNN)
Observe, Explain, Optimize
How to monitor training of NNs ?
- Tracking loss & other metrics
- Inspecting weights, biases and other tensors
- Inspecting representations to some degree (i.e. embeddings)
- Displaying training data
Interpret Training Characteristics
- Reminder: tracking loss and averaged metrics
- Convergence time
- Absolute best loss/metric values
- Relative training behaviour: stability, robustness,...
- Inspecting weights, biases, activations, gradients and other tensors
- Converging/Diverging ?
- Not changing over training time ?
- Sparse ?
- Within a good norm ?
- Metrics are task-specific and not always meaningful
Examples of Hyperparameters
hidden units
number of layers,
activation function,
convolution (stride/filters),
epochs,
learning rate,
batch size,
optimizer,
regularization,
momentum,...
We optimize these by monitoring the validation loss
How to find good meta-parameters?
- Systematic procedure:
- Intuition + Grid Search (most common)
- Random Search (if we have no idea)
- Bayesian Optimization (best in theory)
- Evolutionary Algorithms, Gradient-based
- In practice:
- Intuition first
- then informed Random Search or Bayesian Opt.
- Always iteratively:
Start coarse search to observe behaviour, then increase granularity
Grid vs. Random Search
Grid Search
Tests all possible combinations of the parameters.
Finds the best parameters if they are within the grid.
Time-consuming: Can be very slow, especially with a large number of parameters or if each model takes a long time to train.
Easy to Implement: set up with clear and simple logic.
More suitable when the parameter space is small
Random Search
Samples parameter settings at random for a fixed number of iterations.
Efficiency: Can find a good set of parameters faster than grid search when the parameter space is large.
Less Precision: May miss the exact best parameters but often finds a close approximation with significantly less computation.
More scalable to high-dimensional spaces.
Better suited for when the dataset or parameter space is large.
Both:
Easy to implement and parallelize
Asynchronous and stoppable/plausible at any given time
But non-adaptive: Still beaten by more complex search algorithms
Bayesian approaches are more intelligent (but hard to parallelize & have own hyperparameters)
—> Random search is better than naive grid search (not all hyperparameters are significant)
Bayesian Optimization
Bayesian optimization
Assume that model performance is a smooth function in the space of hyper-parameters:
Impose a Gaussian Process prior
Search for hyper-parameters that has the largest chance of improving given the current results
BoTorch, Keras Tuner
Challenges in Neural Network Optimisation - SGD
- Additional problems by SGD:
- Ill-conditioned gradients
- Plateaus, Saddle Points
- Inexact Gradients
- Additional problems by deep architectures:
- Cliffs and Exploding Gradients
- Long-Term Dependencies
- Convergence is never guaranteed!
- Use rule of thumb and related experience for training and representation methods
- Making good use of regularisation
Interpretability vs. Explainability
Interpretability
Being able to determine cause and effect from a ML model
Explainability
Knowing what a node represents and its importance to the model's performance
Explaining network parameters and activation - Weights
Inspecting Weight Matrices
Q: What is the relation between (hidden layer) connections?
Approach: Visualise connection strength directly
Difficulties:
- Lack of Contextualization
- Indirect Interaction
- Dimensionality and Scale
Explaining network parameters and activation - Visualize features
Visualizing features by optimization
Start from random noise image
Optimize image to activate particular neuron:
Calculate gradient for increasing neuron responses
Adjust image based on gradient
Objectives
Applicable to unit or layer of interest
Deconvolution
Forward Pass:
input image is passed through the network up to a certain layer.
During this forward pass, all the activations are stored.
These activations represent what filters have responded to in the image.
Select Activation:
To visualize the features that a particular filter has learned to recognize
select a specific activation from a specific filter within the layer of interest
This activation map shows where the filter responded strongly
Reverse Mapping (Deconvolution):
Starting from the selected activation, you work backwards through the network
(this is where the term "deconvolution" is often used, although it's not technically deconvolution in the strict mathematical sense).
map the activations back to the pixel space of the input image to see what part of the image caused the activation. This involves:
Unpooling: Reversing the max-pooling operation by placing the activations back into the location of the maximum values that were recorded during the pooling in the forward pass.
Transposed Convolution: Applying transposed convolution operations using the stored weights from the forward pass. This step aims to reconstruct the image area that would induce the activations in the forward pass.
Iterate Back to Input:
You iterate this process back through the layers of the network until you reach the input layer.
At each layer, you're essentially asking, "What input would have caused this filter to activate in this way?"
Visualization:
The resulting mapped image often highlights the patterns or parts of the original input that the filter is responsive to.
For example, if the filter has learned to recognize edges at a certain orientation, the visualization might show those edges from the input image
Visualising Features via Gradient based Localisation
Gradient-weighted class activation mapping
Attribution of local input importance for class
Attribution / Saliency Map
similar to Grad-CAM
Representation Reduction
How can you identify under- / overfitting?
High Training Error: The model does not perform well even on the training data.
Simplistic Model: The model is too simple to capture the underlying structure of the data (high bias)
Close Training and Validation Error: Both errors are high, but they are relatively close to each other.
Low Training Error: The model performs exceptionally well on the training data.
High Validation/Test Error: There is a significant drop in performance on the validation or test dataset compared to the training dataset.
Large Gap Between Errors: There is a substantial gap between training error and validation error, with training error being much lower.
Types of Sequence Learning + Examples
one to many
many to one
many to many
Types of RNNs (superficial)
Simple RNN
Previous activation adds context to the current activation
Examples: Elman network, Jordan Network
Fully Connected Neural Network
Often called auto-associator
Examples: Hopfield Network (binary), Boltzmann machine (stochastic)
One to Many - Vector to sequence (RNN)
Many to one - Sequence to Vector
Many to Many - Sequence to sequence
Representations
One-hot encoding
Simplest way to represent things in neural networks
one neuron to each concept/feature (Localist Representation)
Easy to understand
Easy to code by hand
used to represent inputs to a net
Easy to learn
Easy to associate with other representations or responses
One-hot encoding in machine learning and natural language processing contexts
localist models are inefficient whenever data has componential structure --> not enough neurons to code all possibilities
Softmax
desirable in classification: output vector models the joint probability distribution
For classification: may have some generated output values using cross-entropy loss
This can be normalized with softmax so that the values are in [0,1] and add up to 1
Drawback: Computationally expensive for very large vectors (exp)
Representation Distributed
Using simultaneity to bind things together
Round, yellow fruit: one neuron ?
"Distributed representation" means a many-to-many relationship between two types of representation (such as concepts and neurons)
Each concept is represented by many neurons
Each neuron participates in the representation of many concepts
Example: How to distinguish from representing yellow circle and blue triangle
Word Embeddings
Millions of words: Need distributed representations!
Approach: Learning word embeddings:
Map words to continuous, lower dimensional vectors
Captures word meaning in the semantic space
Resulting word vector should contain linguistic context information, relating it to other words
Preprocessing Sequences in a Nutshell (Text, Speech/Sound)
Text
Consider character vs. word level
Cleaning: special characters, capitals
Stemming i.e. PorterStemmer
Tokenization:
Character/word into atomic units
Build vocabulary over all units
Speech
Basic format: RAW, WAV, PCM signal
sampling frequency
bit depth
Conversions
e.g. STFT
Learning with backpropagation through time
Unfolding the network over time provides deep feedforward network (in example: 3 steps)
Then trained like usual BP
Vanishing/Exploding Gradient!
- How To Tackle Vanishing Gradient Problem
First RNN Constraint: Avoid Error Multiplication - LSTM
(Gating)
Avoid Error Multiplication
Activation of LSTM
Gated Recurrent Unit (GRU)
merges the cell state and hidden state
resulting in a more efficient model with fewer parameters than LSTMs
Second RNN Constraint - Multiplicity of Time
Third RNN constraint: No training in hidden layers
Define Markov Chains
mathematical system
hop from one "state" (a situation or set of values) to another.
Different states in a state space
probability of hopping from one state to any other state
Markov Chain gives:
Examples of states in Markov Chains
Hidden Markov Models
3 Main Problems of HMM
Evaluate with Forward Algorithm
Calculate how likely a sequence of observations is given a specific HMM.
Decode with Viterbi Algorithm
Determine the most likely sequence of hidden states that produced the observed sequence.
Learn with Baum-Welch Algorithm
Find the HMM parameters that maximize the probability of the observed sequence.
What to do with Continuous Latent Variables ?
Discretise continuos data
vector quantization
Speech: phonemes
Visual phonemes: visemes
Define Embeddings
transform categorical, discrete, or high-dimensional data into continuous vectors of much lower dimensionality
used then in ML model
Manifolds
Define Word-Embeddings
Latent Semantic Analysis - How to derive word embeddings?
Word2Vec
GloVe
Data Augmentation
General Idea: replace empirical distribution with smoothed distribution
Approach: build automated augmentation in data loader
Transfer Learning
machine learning technique where a model developed for one task is reused as the starting point for a model on a second task
Transfer Learning vs. Fine-tuning - Different approaches
Fine - Tuning
Self-Supervised Pre-Training
Contrastive Learning: CLIP
What does it mean to freeze weights and layers ?
prevent the updating of the parameters (weights and biases) of those layers during training.
done when applying transfer learning, where a pre-trained model is used as a starting point, and only some layers are fine-tuned for a new task.
ensures that the learned features from the pre-trained model are preserved while only the unfrozen layers are allowed to adjust and adapt to the new data.
Sequence Learning with 1D CNNs
sequences are fed into the network as a sliding window with a fixed width.
the network looks at a fixed number of elements at a time
each word is represented by a 6D - vector
Convolution applies a filter or kernel to extract features from the sequence,
Pooling reduces the dimensionality and to capture the most salient features.
The output is a sentiment polarity, which is categorized into two classes after being processed through a fully-connected layer
What is the alignment problem in seq-to-seq learning (Example: Problem in Machine Translation)
the challenge of determining which words in the source language correspond to which words in the target language
involves aligning elements of two languages that have different structures and word order
Define the attention mechanism for seq 2 seq taks
allows the model to focus on different parts of the input sequence
when generating each part of the output sequence,
thereby improving the alignment between input and output elements in tasks like translation.
What are the attention mechanism basics ?
Keys: elements from the input sequence
Query: current element being processed in the output sequence
Values: representations from the input sequence that are used to construct the output
mathematically as a weighted sum:
For each query, the attention function computes a set of attention weights (α_i),
which are then used to create a weighted sum of the values (v_i).
The attention weights (α_i) are computed using a score function that measures how well each key corresponds to the current query.
done by using a softmax function to ensure the weights sum up to 1, giving a probability-like distribution over the keys.
The score function can be a separate feedforward neural network that is trained jointly with the rest of the model.
What is self-attention ?
to weigh the importance of different parts of the input when processing each word (or token) within the same sequence.
The goal is to learn which self-activation yields the highest correlation between the current words and their context within the sequence.
Soft Attention vs Hard Attention
Hard Attention:
Selects specific parts of the input data to focus on and ignores the rest completely
non-differentiable, meaning it doesn't allow the use of standard backpropagation methods for training
better performance because it's more focused
more challenging to train due to the need for alternative methods like reinforcement learning or Monte Carlo methods.
Soft Attention:
Weights all parts of the input data to varying degrees without completely ignoring any part.
differentiable, which allows the model to be trained using gradient descent.
Global Attention vs. Local Attention
Global attention
Considers all inputs of window-width for alignment score
Expensive
Local attention
Practical tradeoff between soft and hard attention
Alignment can be monotonic or predictive
Contextual Word Embeddings
Traditional Word Embeddings
context-free, meaning each word is given the same representation regardless of its meaning in context.
For example, "bank" would have the same vector representation in both "bank account" and "bank of a river."
Embeddings from Language Model (ELMo):
provides deep contextualized word representations.
considers the entire sentence to determine each word's embedding
uses an attention RNN (Recurrent Neural Network), which likely refers to a bidirectional LSTM (Long Short-Term Memory) model
processes text both from left to right and right to left, capturing information from the entire sentence.
context in which a word is used (its syntactic and semantic characteristics) influences the word's embedding.
allows to model complex characteristics of word use and how these uses vary across different linguistic contexts
Architectures for Sequence Processing
Define Transformers
neural network architecture that rely on self-attention mechanisms
to process sequential data in parallel and capture dependencies,
without relying on recurrent layers like in RNNs.
Architecure:
stacked encoders and decoders
encoders process the input sequence in parallel to produce a representation
decoders generate the output sequence from this representation
also using self-attention and attending to the encoder's output.
Define Positional Encoding
a vector that represents the position of words in a sequence to provide the model with information about the order of words.
needed to maintain the sequence information (word-order) which is vital for understanding language structure and meaning
Define Multi-Head Attention
core feature of the Transformer model.
runs several attention processes in parallel (the 'heads'),
allowing the model to focus on different parts of the input sequence and capture various aspects of the information.
Scaled Dot-Product Attention:
Each attention head performs a scaled dot-product attention
involves calculating the dot product of the query with all keys
scaling the result by the square root of the dimension of the keys
applying a softmax to obtain weights on the values, and then producing an output.
Why is Multi-head Attention effective ?
- No recurrence needed
- Self-attention:
- Connects embedding & positional information
- Multiple heads:
- Learn different type of relations: structure-semantic
- Example: next-word, verb, subject
How to train/optimise Transformers ?
Unsupervised pre-training
Related to word2vec: Learn embedding into (lower) transformer blocks
Typical tasks: language modelling or sentence prediction for unsupervised corpus U, maximise likelihood
Supervised fine-tuning
Continue learning on downstream task (possibly fix k lower blocks)
Crucial modification: adapt token representation
Often the only feasible step for normal labs (with no massive TPU cluster)
How to use Transformers for Computer Vision
Image chopped into 16x16 patches, instead of filtering whole image (CNN)
Position embedding and different levels of relationships learned
Particular strength: emerging attention maps on different levels
Define BERT
Bidirectional Encoder Representations from Transformers
Transformer-based architecture, focus on encoder blocks
pre-trains on a large corpus to learn bidirectional representations of text
Contextual model
can then be fine-tuned for a variety of language tasks.
Fine tuning BERT on Different Tasks
Define GPT
Generative Pre-Training
uses transformer architecture, trained on a large corpus of text in an unsupervised manner
to generate human-like text by predicting the next word in a sequence given the words that come before it
Focus on transformer-decoder blocks
autoregressive, meaning they predict the next word based on the sequence of all previous words
Model Architecture:
consist of multiple transformer-decoder layers (12x to 48x indicates the number of layers, with GPT-3 having up to 175 layers).
They use masked self-attention, where each position can attend to all positions up to and including itself during training.
Describe a Text-to-Text Transformer
Focus on training data selection
massive SuperGLUE benchmarks (supervised)
randomly corrupted tokens (unsupervised)
Contextual model (Transformer based)
treats all tasks as text-to-text problems, where input text is transformed into output text using a seq2seq Transformer model
Describe ChatGPT training process
Strength and Limitations of Transformers
Transformers are sophisticated pattern matching machines
Best performing embedding for many downstream tasks
Continuous processing, Parallelisation, long memory
Currently best performances on many NLP and CV problems
Successfully deployed in Google Search,...
Criticism is vast:
(Research) competition only possible for big-tech companies
Computationally expensive training
Need for vast training data
Does only work with such vast data
Does not "understand" natural language
Training data is full of bias
GPT-3 is a better bullshit artist than its predecessor, but it's still a bullshit artist.” – Gary Marcus
“Focusing on raw computing power misses the point entirely […] We don't know how to make a machine really intelligent - even if it were the size of the universe.“- Stuart Russell
What is the key component of a transformer ?
self-attention mechanism
allows the model to weigh the significance of different parts of the input data differently and is crucial for capturing the context within sequences.
Define Inference
General
“inference is a conclusion that you draw about something by using information that you already have about it.“*
compute the probability distribution over one set of variables given another
ML Context
inference often refers to the process of estimating or concluding about the posterior distribution of a latent variable Z, given observed data X.
Motivation behind approximation
What are variational methods ?
The objective is to identify a function that achieves a specific goal, such as minimizing a cost function or, as in the slide's example, maximizing entropy.
What is Kullback-Leibler Divergence ?
measure of how one probability distribution diverges from a second, expected probability distribution
used for variational inference
Goal: to approximate the true posterior distribution `p(Z|X)` with a simpler distribution `q(Z)`
Given two Gaussian Probability Density Functions (PDFs): p(x) and q(x).
These curves represent two different distributions.
In variational inference, p(x) could represent the true distribution of data, while q(x) represents the approximating variational distribution.
The shaded area represents the integral of the KL Divergence across the range of values.
It quantifies the difference between the two distributions.
The divergence is calculated using the formula:
Goal of Kullback Leibler Divergence
Asymmetry of KL
min KL(p||q)
represents minimizing the KL Divergence where the true posterior (p) is the first argument.
known as the "reverse" or "backward" KL Divergence.
tends to produce an estimate (q) that covers all the modes of the true posterior but may not capture them accurately;
it aims to ensure that the estimated distribution does not assign probability to areas where the true distribution has none
min KL(q||p)
represents minimizing the KL Divergence where the estimate (q) is the first argument.
known as the "forward" or "inclusive" KL Divergence.
tends to produce an estimate that captures the mode of the true distribution very accurately but may ignore other modes;
ensures that all the probability mass of the estimated distribution is placed where the true distribution has its probability mass.
choice between minimizing KL(p||q) versus KL(q||p) has a significant impact on the behavior of the estimation process
affects the approximation of the true posterior distribution and can lead to different estimates
which might be more suitable for different applications depending on the desired outcome
(e.g., capturing all modes versus focusing on the most significant mode)
Properties of KL divergence
Positive (or zero): always greater than or equal to zero
Monotone: KL Divergence does not decrease as the probability mass function of q moves away from p.
Additive for Independent Distributions: If p and q represent independent distributions, then the divergence of their product is the sum of their divergences.
Not Symmetric: KL Divergence is not symmetric
Sensitive to Change of Scale: The measure changes if the scale of the probability distributions changes.
Integral or Sum Form: KL Divergence can be expressed as an integral for continuous distributions or a sum for discrete distributions.
Quantifies Information Gain: KL Divergence measures the amount of information gained by transitioning from one distribution to another, often interpreted in the context of how much information is lost when a distribution q is used to approximate another distribution p.
Understand Expectation Notation
Describe Evidence Lower Bound (ELBO)
Optimization Goals for approximating distributions / variational Inference
What is the Mean Field Theory ?
simplifies the problem by assuming that the unknown distribution can be factorized into M disjoint groups,
which means that the distribution over the latent variables Z is approximated by a product of independent distributions q_i(Z_i) for each group of variables
Steps for Applying Mean Field Theory
Identify the True Distribution: Begin with the given distribution $p(z)$ that we wish to approximate.
Assume Independence: Postulate that the complex distribution can be represented as a product of simpler, independent distributions for each variable.
Factorize the Distribution: Construct a factorized distribution q(z) as the product of the individual independent distributions for each variable.
Measure Divergence: Apply KL Divergence to assess the deviation of the factorized distribution q(z) from the true distribution p(z).
Iterate for Better Approximation: Modify the factorized distribution by iteratively considering the dependencies, which can lead to a closer approximation of the true distribution.
Attention
Known Distribution: Initially, only the true distribution p(z) is known, and the goal is to approximate it.
No Gaussian Assumption for q: The approximation does not need to assume that q is Gaussian; this is an outcome of the approximation, not a precondition.
Coupled Estimates: The estimates for the individual distributions in q(z) are interconnected, requiring iterative adjustments to refine the estimates and improve the approximation.
Steps in Variational Inference
Define Autoencoder
neural network model used for unsupervised learning
aiming to learn a compressed representation of the input data
by approximating an input x with a reconstrution r
Explain the Autoencoder components
Why care about latent space ?
Compact representation
Dimensionality reduction
"finding simpler representations"
information retrieval
Ideally approximates the real distribution of the observed data
When is an Autoencoder equal to PCA ?
Variants of Autoencoders
Applications of Autoencoders
Feature extraction
Denoising
Inpainting
Segmentation
Define Variational Autoencoder
A variational autoencoder is a
Generative Autoencoder: replace deterministic z with stochastic sampling operation
Directed model that approximates inference, i.e. distribution in latent space
Process of Variational Autoencoder
- What is the reparameterization trick ?
the stochasticity is separated from the parameters
enabling the use of backpropagation through deterministic computations.
- Why need the reparameterization trick ?
Problem: backpropagation through the sampling node, which is the difficulty of passing deterministic gradients through a stochastic sampling process.
How to train a variational autoencdoer?
Define encoder and decoder e.g. depth
Feed image(s) thorugh encoder
Sampling
Compute loss
Update parameters
repeat
Applications of VAE
Image Generation: VAEs can generate new images that resemble a training dataset, which is useful in art creation, design, and entertainment.
Anomaly Detection: By learning to represent normal data, VAEs can identify anomalies or outliers in datasets, which is valuable in fields like fraud detection or fault diagnosis.
Drug Discovery: VAEs help in generating molecular structures by learning the distribution of molecular data, thereby aiding in the discovery of new drug candidates.
Feature Extraction and Dimensionality Reduction: VAEs are used to learn lower-dimensional representations of data, which can serve as feature vectors for other machine learning tasks.
Semi-Supervised Learning: They can be employed in scenarios where only a small subset of data is labeled, leveraging the unlabeled data to improve learning efficiency.
Reinforcement Learning: In reinforcement learning, VAEs can learn to encode states and rewards, assisting in the creation of more efficient and generalizable policies.
Text Generation: VAEs can also be adapted for generating coherent and diverse text, and are used in natural language processing for tasks like dialogue generation and machine translation.
Speech Synthesis: They are used in generating human-like speech from text or other forms of data, which is useful in virtual assistants and other speech-based interfaces.
Style Transfer: VAEs can learn the style of one dataset and apply it to another, which is popular in image and video editing applications.
Interpolation: Because VAEs learn smooth latent representations, they can interpolate between data points to create transitions, such as morphing one image into another.
What is the ß-VAE
Idea: Learn disentangled representations by weighting the KL divergence term in the loss function by a factor β.
By manipulating β, the model can be tuned to focus more on the latent space factorization (β>1)
or on reconstruction fidelity (β=1).
faces generated by β-VAE and a standard VAE.
With a high β (250), β-VAE learns to disentangle the rotation of the face images, meaning that changes in a single latent variable correspond to changes in rotation only.
In contrast, the standard VAE with β=1 does not disentangle features as clearly, resulting in changes in identity and expression as well.
VAEs can yield blurry reconstructions
a higher value of β can lead to even blurrier images
as the model prioritizes learning disentangled representations over reconstructing high-fidelity images.
VAEs with constraints
The loss function for these VAEs includes
the standard reconstruction los (LreconstructLreconstruct), which ensures the output closely resembles the input.
a prior distribution over the latent variables, which encourages their distribution to match some predefined prior.
a flow term may refer to an additional constraint, possibly from normalizing flows, which are a method for improving the flexibility of the approximate posterior distribution in VAEs.
Facial Expression Manipulation Task:
practical application of this constrained VAE framework
Loss Function Terms:
The terms λ1 and λ2 are hyperparameters that weigh the contribution of the respective terms in the loss function
allowing for balancing between accurate reconstruction, adherence to the prior, and the additional flow constraint.
What can generative models be used for ?
Density estimation
Outlier detection
Synthetic data
Data augmentation
Missing data
Semi-supervised learning
De-bias dataset
Latent spaces
Lower dimensional representation
interpolation
Multi-modal outputs
Simulate possible futures for planning or simulating RL
What generative models do you know ?
Gaussian Mixture Models
Variational Autoencoders
Define GAN
GANs consist of two neural networks that compete against each other in a game theoretic scenario:
Discriminator (D)
distinguishes between real data samples and fake data samples generated by the Generator.
trained to output the probability that a given sample is real.
Generator (G)
This network generates new data samples from random noise input z, which follows a probability distribution p_z(z)
usually a Gaussian distribution N(0,I) where I is the identity matrix.
Data flow in a GAN
Real data samples are fed to the Discriminator
The Generator takes in random noise and outputs fake samples
The discriminator then assesses both real and fake samples and assigns a probability to the likelihood of each sample being real rather than fake
Objective Function of GAN
Why does VAE yield blurry images but GANs do not ?
VAEs
VAEs learn an "explicit" distribution of the data,
which involves directly modeling the probability density of the latent space
It has two conditions: minimizing the reconstruction error and be close to a predefined distribution of the data
when prioritizing the adhering to the distribution a lot, for example by the ß-parameter, it stays in contradiction with learning the exact reconstruction
—> blurry images
Generative Adversarial Networks (GANs):
GANs learn an "implicit" distribution of the data.
There no explicit density estimation, which means GANs can focus more on producing sharp, high-quality images
because they are not penalized for deviations in an explicit probability space, unlike VAEs
Bernoulli - GANs
How to train a GAN ?
Training iterations: The training process involves several iterations.
Training the Discriminator:
A minibatch of noise samples is drawn from a noise prior p(z) and passed to the generator.
The generator produces a minibatch of fake data samples from the noise samples
A minibatch of real data samples is drawn from the data generating distribution p_data(x).
Update the discriminator by ascending its stochastic gradient: This step involves feeding both real and fake data into the discriminator and adjusting its parameters to maximize the probability of correctly classifying real and fake data.
Training the Generator:
Again, a minibatch of noise samples is drawn from the noise prior p(z).
The generator uses this noise to produce a minibatch of fake data samples.
These fake data samples are then passed to the discriminator, which classifies them as real or fake.
The generator is updated by descending its stochastic gradient,
The feedback from the discriminator (the probability of the fake data being real) is used to update the generator's weights. This is done by descending its stochastic gradient: encourages it to produce data that the discriminator will classify as real.
Iteration Example
Leftmost Panel: The initial state where the discriminator can easily distinguish between the data (solid blue line) and the generator's output (dotted green line).
Second Panel: The generator improves, and its distribution starts to overlap with the data distribution.
Third Panel: The generator gets even better, further overlapping with the data distribution, making it harder for the discriminator to differentiate.
Rightmost Panel: The discriminator's decision boundary (D(x) = 0.5) is shown where it is now uncertain about half of the generated data, indicating the generator has improved significantly.
Advantages and Disadvantages of GANs
Loss function:
Competition between networks is sole training criterion
unsupervised
No approximation of inference needed
Can represent sharp or degenerative distributions
Interpolation in latent space = interpolation between samples
If max V(G,D) is convex, procedure is guaranteed to converge
Disadvantages
No explicit representation of distribution
Only generator, no encoder, hence unknown latent code
little data typically leads to discriminator overfitting
Difficult to train
Mode collapse
Quality metric
Quality Metrics for GANs
Variants of GANs
Embedding Process in GANs (Image to latent code)
Define Diffusion models
latent variable models
work with latent variables instead of observed variables
use the latent variables to generate data that resembles originial data
Dimensionality of latent variables
have the same size and shape as the data you're working with.
When generating images, latent variables will be structured like images
Training of Diffusion Models
Forward Process (Diffusion)
model starts with your data (i.e. image) and gradually adds random noise to it, step by step, until the data turns into pure noise
Reverse Process (Reconstruction)
the model learns how to reverse the process, essentially learning how to remove the noise step by step to reveal the original picture.
This is the actual generation process — it's like having a noisy image and cleaning it up to get a clear picture.
Training the Model
to train this model, you don't need to know exactly how to go from the noisy image back to the original one directly.
Instead, you use something called the variational lower bound, a technique from statistics that helps you estimate the reverse process.
It's like guessing the steps needed to clean up the foggy picture without knowing the exact way it was fogged up.
Explain the Forward Process in Diffusion Models
Explain the Reverse Process in Diffusion models
Extensions of Diffusion models
Diffusion Probabilistic Models (2015): The foundational concept of diffusion models was introduced, setting the stage for subsequent developments.
Denoising Diffusion Probabilistic Models (DDPM, 2020): A specific type of diffusion model that focuses on generating images by denoising, but it has the drawback of being slow in image generation.
Variational Diffusion Models (VDM, 2021): These models were introduced to speed up the optimization process. They prioritize optimizing the likelihood of data over the quality of the samples generated, aiming for faster training times.
Denoising Diffusion Implicit Models (DDIM, 2022): A variation of DDPM that uses non-Markovian diffusion processes (meaning the process does not strictly follow a memoryless Markov property) while retaining the same training objectives. DDIM models are significantly faster (10 to 50 times) than DDPM and allow for semantically meaningful interpolations in latent space.
Latent Diffusion Model (LDM, 2022): This model performs the diffusion process in a compressed latent space rather than in the pixel space, which can be computationally more efficient and can potentially generate higher quality images.
Classifier Guided Diffusion: An extension of diffusion models where guidance is provided by incorporating knowledge about different classes into the diffusion process. This can help in generating images that are more aligned with specific classes, enhancing control over the generation process.
Which component in a diffusion model is a Markov Chain ?
Forward Process as a Markov Chain:
each step of adding noise to the data depends only on the state of the data from the immediate previous step.
gradually transforms the data into a pure noise state over a series of steps.
In each step, the future state X_t+1 is conditionally independent of past states given the current state X_t
which is the defining characteristic of a Markov chain.
Reverse Process as a Markov Chain:
each denoising step depends only on the state of the noisy data from the immediate previous step.
The model learns to reverse the noise-adding process by predicting the clean data state at each step, conditioned only on the current noisy state.
The goal is to ultimately reconstruct the original data from the noise.
What are elements of a Graph ?
Vertices or Nodes: points on a graph where lines intersect or end.
Edges and Directions: lines that connect the vertices. They can have directions, which means they point from one vertex to another.
Universal/Global Attributes: These are the characteristics or properties that apply to the entire graph.
Vector Representation:
each of the elements mentioned above can be represented by a vector
a vector has magnitude and direction.
What are different types of Graphs ?
Euclidean Graph
structured graph where the positions of the vertices are fixed and regularly spaced, much like a grid. (structure and neighbourhood in euclidean space)
In Euclidean space, the placement of vertices and edges is based on geometry, which means they have a consistent, ordered arrangement. (Geometric alignment)
Arbitrary Graph (Non-Euclidean)
does not have a regular structure.
vertices and edges are arranged in a complex pattern and can be irregular, meaning they don’t follow a predictable spacing or pattern.
Relationships between nodes in this graph can be non-linear, which means they don’t just move in straight lines or predictable curves.
Data from a Graph perspective
Data
represented by a graph with nodes (or vertices) connected by edges
nodes are colored, suggesting they have values or attributes associated with them.
Signal
represents the attributes or features of the nodes.
shown as separate nodes with colors, but no connecting edges.
like having different pieces of information or data points without yet understanding how they are connected.
Structure
the connections between nodes
it represents only the relationships or how each piece of data is related to another.
the framework or scaffold without the specific details (features) filled in.
—-
When combined, the signal and structure make up the data in the graph.
data in a graph is made up of the individual pieces of information (signal) and the way those pieces are connected or related (structure).
Both are crucial for algorithms to learn patterns and make predictions.
Graph structures
Image Pixels
Each square is a pixel with coordinates like (0-0, 0-1, etc.).
The color intensity can represent different features such as brightness or color in the image.
Graph
Image pixels can be converted into a graph format.
Each pixel is now a node (or vertex) in the graph, and the nodes are connected with edges.
The edges might represent the adjacency or relationship between pixels, such as proximity in the image.
Certain nodes are highlighted, indicating special features or importance, like a key part of the image that may be the focus of analysis.
Adjacency Matrix
a square matrix used to represent a finite graph.
elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.
a blue square indicates an edge between two pixels (nodes), and a white square indicates no edge.
provides a numerical way to encode the structure of the graph, which can be processed by algorithms.
What is a adjacency matrix ?
table that shows which nodes (points in the graph) are connected to which. The matrix has rows and columns labeled by graph vertices.
can be "sparse" which means that most connections are absent (most of the matrix is zero),
can be "banded" indicating that only adjacent nodes are connected (so non-zero values appear in a band or stripe pattern).
can also represent the "weight" of the connections, meaning it shows not just if two nodes are connected, but also how strong or important that connection is.
What is a Adjacency List ?
a more "condensed" way of representing a graph that takes up less space when the graph has fewer connections.
for each node, you have a sublist of other nodes that it's connected to.
It's easy to add new nodes to this representation because you just add another list.
The weights of the relationships can be included as a third value in the sublists.
Geometric Learning Principles: Need Geometric Priors
principles of geometric learning, specifically the need for geometric priors, which are assumptions or inbuilt knowledge about geometry in the learning models.
Geometric Learning for Learning Stable Representations
Graph Neural Networks
GNN - Basic Example
Tasks for GNNs
Convolutional on Graphs
Convolutional Graph Neural Networks
Pooling on Graphs
Convolutional GNNs or Graph Convolution Networks (GCN)
Graph AutoEncoder
Message passing in Graph networks
Define Domain Adaptation
In many cases, source domain and target domain are different
Example: graphical images vs. real photos
Example: Movie reviews vs. product reviews
Goal: Fit model to source domain, then modif parameters to be compatible to target domain —> domain adapation
Definition: Vast but task-specific field of how can a classifier learn from a source domain and generalize to a target domain
- Define Multi-Task Learning
simultaneously train task on different special target domains
- How do Multi-Task Learning models work ?
Learning with Multiple Heads
model has multiple outputs, each corresponding to a different task
The same underlying model (with shared parameters) is used
it can handle different types of input and output, depending on the task
Key Questions
Conditioning the Model for Individual Tasks
How do we design the model so that it can handle different tasks effectively?
Should we have completely separate parts with no shared parameters,
or should we use a 'multi-headed' approach where there is a common core followed by task-specific layers?
Forming the Objective
How should the goal of the learning process be defined when there are multiple tasks?
Should we prioritize some tasks over others (weighted),
or should the model learn how to balance the tasks on its own (dynamically)?
Optimizing the Objective
How do we adjust the model to meet this objective?
One approach mentioned is to use mini-batches, which are small subsets of data.
We can either sample a mini-batch of different tasks
or a mini-batch of datapoints for each task during the training process.
Multi-task Learning Challenges
Negative Transfer
learning one task adversely affects the performance on another.
Instead of the knowledge from one task helping another, it hinders it.
Cross-task inference:
interference between tasks can lead to negative transfer
the model incorrectly applies what it has learned from one task to another.
Meta-parameter differences
Different tasks may require different settings for their meta-parameters (like learning rate, number of layers, etc.)
finding a set that works for all tasks can be challenging.
Limited representational capacity
The model might not have enough capacity to learn all the tasks effectively, which can lead to a decrease in performance.
Small Data Sizes & Overfitting
model learns the details and noise in the training data to an extent that it negatively impacts the model's performance on new data.
sharing more data among tasks can act as a form of regularization, which helps to prevent overfitting by encouraging the model to learn more general patterns that apply across tasks.
How to Choose Task Combinations?
Deciding which tasks to combine in a multi-task learning scenario is not straightforward, especially if there are many potential tasks.
Compute "Inter-task affinities"
evaluate how related or beneficial the tasks are to one another, to inform the combination of tasks
a network with connections between tasks such as segmentation, keypoints, edges, normals, and depth, likely representing the inter-task relationships.
Multi-task Learning Generalisation
General Goal
create models that can generalize well to new, unseen data.
the model should be able to make accurate predictions or decisions on data it hasn't encountered during training.
Domain Generalization
concept of training a model on multiple domains
The goal is to achieve low loss on new data distribution, which means the model should make as few errors as possible on the new domain it's tested on.
Mathematical Formulation
represents the objective of finding the best function which minimizes the expected loss across all tasks.
way to formalize the learning process to make sure the model performs well on new data (domain generalization).
This is further illustrated with a formula that combines the losses from all tasks during training to achieve the best performance on the test domain.
Model diagram
a general model with a shared parameter set θ) that branches out into multiple "task heads".
Each head ψ corresponds to a different task the model has learned during training.
There is also a "task head" for the test domain (ψ_{test}), which indicates that the model is trying to apply what it has learned to a new domain.
How to Learn
meta-learning is a strategy to improve learning
"learning to learn"
involves designing models that can learn new tasks with minimal data by effectively leveraging past knowledge.
higher-level learning process where the model is not only learning the tasks but also learning how to adapt to new tasks more efficiently.
Define Meta-Learning
Basic Learning
standard process of training a machine learning model on a dataset to learn a specific task, like distinguishing between images of dogs and otters.
Meta Learning
Meta learning goes beyond basic learning.
the algorithm itself learns how to adapt to new tasks quickly with minimal data.
involves training on a variety of "special target domains" (different types of tasks or data, like images of various categories)
and then testing the model's ability to apply what it has learned to a new, unseen domain.
Gradient-based: Model-Agnostic Meta-Learning (MAML)
Meta Learning - Different approaches
Define Few-Shot Learning
technique where the model is designed to learn from a very small amount of data, referred to as "few shots".
In the context of meta-learning, few-shot learning refers to the model's ability to adapt to new tasks with very limited data.
Zero-shot learning is mentioned as a related concept where the model makes predictions for tasks without having seen any labeled examples at all.
C-way N-shot Terminology
"C-way" refers to the number of classes (distinct categories) involved in a task.
3-way" would mean there are three different classes.
"N-shots" indicates the number of examples per class.
"2-shot" means there are two examples for each class.
if tasks do not share classes, the learning process forces the model to focus on features that are specific to each class, rather than relying on generic features or shortcuts (like the background of images).
Define Reinforcement Learning
type of machine learning where a program (often called an agent) learns to make decisions by trying out actions and seeing what happens
much like how a person learns to play a new game by trying different moves and remembering which ones worked well (trial and error)
Variables of RL
State S_t
situation the program finds itself in at any given moment.
In a game, this could be the arrangement of pieces on a board.
The state is what the program observes and uses to decide what to do next.
sometimes the program doesn't get to see everything (partial observation o_t) and must make the best guess.
Action a_t(S_t)
Based on the state, the program decides to take an action.
For example, in a game like chess, an action would be moving a pawn.
The action depends on the current state of the game.
Reward r_t:
After the program takes an action, it gets a reward based on how good that action was.
This is like getting points in a game. The reward helps the program understand if the action it took was beneficial or not.
Policy π_θ:
the strategy the program uses to decide which actions to take as it goes along.
Think of it as the program's game plan or set of rules it follows to try to win or achieve its goal.
Goal of RL
Goal
aim is for the program to select actions that will give it the most reward not just immediately but in the long term too.
It's like planning several moves ahead in a game instead of just the next move.
Actions and Consequences
Actions the program takes can have long-term effects,
which means a good action now could lead to more rewards later on.
Delayed Reward
Sometimes, the reward for an action isn't immediate.
The program might have to wait a while to see if an action was truly good.
Advantages & Disadvantages of RL
Partial target information
can learn effectively even when they don't have full information about the outcome of their actions.
They're designed to work with incomplete knowledge about the environment.
Good in low-dimensional (low-D) tasks:
effective in problems where the number of factors to consider (dimensions) is relatively small,
which can include many real-world scenarios.
Disadvantages of RL
Sample inefficiency:
need a lot of trial-and-error before learning how to do something well
This means they require a lot of data (or samples) from the environment, which can be inefficient.
High variance and instability:
The measures of performance (like gradients, which show the direction to improve) can be inconsistent (high variance).
This makes the training process unstable and challenging.
Different Approaches in RL
Value-based Reinforcement Learning
Q-Funtion
SARSA-Algorithm
Q-Learning
Policy Gradient Reinforcement Learning
Deep Reinforcement Learning Task
Deep Q-Network
Model-based RL
Meta Reinforcement Learning
How is reinforcement learning different from (un)supervised ML ?
RL learns from interactions and optimizing actions based on rewards over time focusing on making a sequence of decisions
SL learns from labeled examples with immediate error correction
UL is about finding structure in data without labels or rewards
Last changed10 months ago