Buffl

ML

AR
by Alexander R.

Reflect Bayesian regression for linear models

  • incorporates prior beliefs about parameters

  • updates these beliefs after observing data.

  • helps to regularize the solution, preventing overfitting.

1. Define the Prior

  • Establish the prior distribution for the parameters, typically a Gaussian distribution, `p(w) = N(w|m0, S0)`.

  • `m0` and `S0` represent the mean and covariance of the prior belief about the parameters.

2. Formulate the Likelihood Function

  • The likelihood function `p(T|X, w, β)` describes the probability of observing the target values `T` given the input features `X`, parameters `w`, and noise precision `β`.

3. Determine the Posterior Distribution

  • Using Bayes' theorem to calculate the posterior distribution, `p(w|T, X, β)`,

  • updates our beliefs about the parameters after observing the data.

  • the posterior is proportional to the product of the likelihood and the prior.

4. Compute the Posterior Mean and Covariance

  • Calculate the mean of the posterior distribution using the formula `mN = SN(S0^(-1)m0 + βΦ^T T)`.

  • Determine the covariance of the posterior distribution with `SN = (S0^(-1) + βΦ^T Φ)^(-1)`.

5. Interpret the Posterior as a Normal Distribution

  • the posterior is still a Gaussian distribution with updated mean `mN` and covariance `SN`.

  • the peak of the posterior distribution is the most probable estimate of the parameters (MAP estimate), which is `w_MAP = mN`.

6. Role of the Conjugate Prior

  • a conjugate prior is chosen so that the posterior distribution remains in the same family as the prior, simplifying calculations.

7. Role of Evidence

  • evidence, `p(T|X, β)`, is not incorporated when we're only interested in the parameter values that maximize the posterior.

8. Benefits of Bayesian Linear Regression

  • Bayesian regression accounts for uncertainty in parameter estimates.

  • it prevents overfitting by incorporating prior knowledge and updating beliefs in a principled manner.


Apply Full Bayesian approach by computing the predictive distribution by integration over all models


  • Recall that the posterior distribution `p(w|T, X, α, β)` is a Gaussian with mean `mN` and covariance `SN`.

  • `mN` and `SN` are derived from the observed data.

  • The predictive distribution `p(t|x, T, α, β)` allows for making predictions about new, unseen data.

  • This distribution is calculated by integrating the noise model with the posterior distribution over the parameters.

  • The predictive distribution is obtained by integrating the product of the likelihood (noise model) `p(t|x, w, β)` and the posterior `p(w|T, X, α, β)` over all possible weights `w`.

  • The likelihood function for the noise is given by a Gaussian distribution, which represents deviations of the observed target values from the model predictions.

  • the integral of the product of two Gaussians is itself a Gaussian.

  • The resulting Gaussian represents the predictive distribution.

  • The predicted mean `y(x, mN)` is the mean of the predictive distribution, which uses the posterior mean `mN` of the parameters.

  • Determine the predictive variance `σ^2_N(x)` which quantifies the uncertainty of the prediction.

  • The predictive variance includes a term from the noise model and the posterior covariance.

  • the resulting predictive distribution `N(t|y(x, mN), σ^2_N(x))` provides a mean and a variance for the prediction, capturing the uncertainty.

  • the Full Bayesian approach is focused on predicting distributions, not just point estimates, which is essential for capturing uncertainty in predictions.


What is the Max-Margin Problem

  • Max-Margin problem is a formulation used to find the hyperplane that maximizes the margin between two classes in the feature space.

Objective Function

  • The goal is to find a hyperplane with the smallest possible weight vector w that still correctly classifies the data points.

  • Mathematically, it's about minimizing the term below

  • The minimization is subject to constraints ensuring that all data points \( x_n \) are classified correctly, formalized in:


Lagrange Multipliers

  • To solve this constrained optimization problem, Lagrange multipliers a_n are introduced for each constraint.

  • The Lagrangian L(w, w_0, a) is constructed by combining the objective function with the constraints, weighted by the Lagrange multipliers.

  • The problem transforms into minimizing the Lagrangian with respect to w and w_0, and maximizing it with respect to a_n.


Gradient Equations and Dual Representation

  • setting the gradients of the Lagrangian with respect to w and w_0 equal to zero, conditions are derived that allow w to be expressed as a sum of the data points x_n, scaled by the product of their corresponding Lagrange multiplier a_n and label t_n

  • A condition that the sum of the products of the Lagrange multipliers and labels must equal zero is also derived.


  • This leads to the dual representation which only depends on the Lagrange multipliers a_n, and the problem can be reformulated into maximizing the dual representation L(a)

4. **Kernel Function**:

- The kernel function \( K(x_n, x_m) \) represents the inner product of \( \phi(x_n) \) and \( \phi(x_m) \) in the feature space, allowing the SVM to work in a higher-dimensional space without explicitly computing the coordinates.

5. **Maximizing the Dual Representation**:

- The dual problem involves maximizing \( L(a) \) under the constraints that all \( a_n \) are non-negative and the sum of the products of \( a_n \) and \( t_n \) equals zero.

- The dual representation simplifies the problem as it removes the need to work directly with the weight vector \( w \) and allows the use of the kernel trick.

Through the Max-Margin problem, SVMs find a hyperplane that not only separates the data into classes but also stays as far away as possible from the closest data points of any class, aiming for a better generalization to new data points.

Explaining network parameters and activation - Visualize features

Visualizing features by optimization

  • Start from random noise image

  • Optimize image to activate particular neuron:

    • Calculate gradient for increasing neuron responses

    • Adjust image based on gradient

  • Objectives

    • Applicable to unit or layer of interest



Deconvolution

  1. Forward Pass:

    • input image is passed through the network up to a certain layer.

    • During this forward pass, all the activations are stored.

    • These activations represent what filters have responded to in the image.

  2. Select Activation:

    • To visualize the features that a particular filter has learned to recognize

    • select a specific activation from a specific filter within the layer of interest

    • This activation map shows where the filter responded strongly

  3. Reverse Mapping (Deconvolution):

    • Starting from the selected activation, you work backwards through the network

    • (this is where the term "deconvolution" is often used, although it's not technically deconvolution in the strict mathematical sense).

    • map the activations back to the pixel space of the input image to see what part of the image caused the activation. This involves:

      • Unpooling: Reversing the max-pooling operation by placing the activations back into the location of the maximum values that were recorded during the pooling in the forward pass.

      • Transposed Convolution: Applying transposed convolution operations using the stored weights from the forward pass. This step aims to reconstruct the image area that would induce the activations in the forward pass.

  4. Iterate Back to Input:

    • You iterate this process back through the layers of the network until you reach the input layer.

    • At each layer, you're essentially asking, "What input would have caused this filter to activate in this way?"

  5. Visualization:

    • The resulting mapped image often highlights the patterns or parts of the original input that the filter is responsive to.

    • For example, if the filter has learned to recognize edges at a certain orientation, the visualization might show those edges from the input image


Visualising Features via Gradient based Localisation

  • Gradient-weighted class activation mapping

  • Attribution of local input importance for class



Attribution / Saliency Map

  • similar to Grad-CAM


Representation Reduction


Representations

One-hot encoding

  • Simplest way to represent things in neural networks

  • one neuron to each concept/feature (Localist Representation)

  • Easy to understand

  • Easy to code by hand

  • used to represent inputs to a net

  • Easy to learn

  • Easy to associate with other representations or responses

  • One-hot encoding in machine learning and natural language processing contexts

  • localist models are inefficient whenever data has componential structure --> not enough neurons to code all possibilities


Softmax

  • desirable in classification: output vector models the joint probability distribution

  • For classification: may have some generated output values using cross-entropy loss

  • This can be normalized with softmax so that the values are in [0,1] and add up to 1

  • Drawback: Computationally expensive for very large vectors (exp)


Representation Distributed

  • Using simultaneity to bind things together

    • Round, yellow fruit: one neuron ?

  • "Distributed representation" means a many-to-many relationship between two types of representation (such as concepts and neurons)

    • Each concept is represented by many neurons

    • Each neuron participates in the representation of many concepts

  • Example: How to distinguish from representing yellow circle and blue triangle


Word Embeddings

  • Millions of words: Need distributed representations!

  • Approach: Learning word embeddings:

    • Map words to continuous, lower dimensional vectors

    • Captures word meaning in the semantic space

  • Resulting word vector should contain linguistic context information, relating it to other words


Applications of VAE





  1. Image Generation: VAEs can generate new images that resemble a training dataset, which is useful in art creation, design, and entertainment.

  2. Anomaly Detection: By learning to represent normal data, VAEs can identify anomalies or outliers in datasets, which is valuable in fields like fraud detection or fault diagnosis.

  3. Drug Discovery: VAEs help in generating molecular structures by learning the distribution of molecular data, thereby aiding in the discovery of new drug candidates.

  4. Feature Extraction and Dimensionality Reduction: VAEs are used to learn lower-dimensional representations of data, which can serve as feature vectors for other machine learning tasks.

  5. Semi-Supervised Learning: They can be employed in scenarios where only a small subset of data is labeled, leveraging the unlabeled data to improve learning efficiency.

  6. Reinforcement Learning: In reinforcement learning, VAEs can learn to encode states and rewards, assisting in the creation of more efficient and generalizable policies.

  7. Text Generation: VAEs can also be adapted for generating coherent and diverse text, and are used in natural language processing for tasks like dialogue generation and machine translation.

  8. Speech Synthesis: They are used in generating human-like speech from text or other forms of data, which is useful in virtual assistants and other speech-based interfaces.

  9. Style Transfer: VAEs can learn the style of one dataset and apply it to another, which is popular in image and video editing applications.

  10. Interpolation: Because VAEs learn smooth latent representations, they can interpolate between data points to create transitions, such as morphing one image into another.


How to train a GAN ?



  1. Training iterations: The training process involves several iterations.

  2. Training the Discriminator:

    • A minibatch of noise samples is drawn from a noise prior p(z) and passed to the generator.

    • The generator produces a minibatch of fake data samples from the noise samples

    • A minibatch of real data samples is drawn from the data generating distribution p_data(x).

    • Update the discriminator by ascending its stochastic gradient: This step involves feeding both real and fake data into the discriminator and adjusting its parameters to maximize the probability of correctly classifying real and fake data.

  3. Training the Generator:

    • Again, a minibatch of noise samples is drawn from the noise prior p(z).

    • The generator uses this noise to produce a minibatch of fake data samples.

    • These fake data samples are then passed to the discriminator, which classifies them as real or fake.

    • The generator is updated by descending its stochastic gradient,

    • The feedback from the discriminator (the probability of the fake data being real) is used to update the generator's weights. This is done by descending its stochastic gradient: encourages it to produce data that the discriminator will classify as real.


Iteration Example

  • Leftmost Panel: The initial state where the discriminator can easily distinguish between the data (solid blue line) and the generator's output (dotted green line).

  • Second Panel: The generator improves, and its distribution starts to overlap with the data distribution.

  • Third Panel: The generator gets even better, further overlapping with the data distribution, making it harder for the discriminator to differentiate.

  • Rightmost Panel: The discriminator's decision boundary (D(x) = 0.5) is shown where it is now uncertain about half of the generated data, indicating the generator has improved significantly.


Extensions of Diffusion models




  • Diffusion Probabilistic Models (2015): The foundational concept of diffusion models was introduced, setting the stage for subsequent developments.

  • Denoising Diffusion Probabilistic Models (DDPM, 2020): A specific type of diffusion model that focuses on generating images by denoising, but it has the drawback of being slow in image generation.

  • Variational Diffusion Models (VDM, 2021): These models were introduced to speed up the optimization process. They prioritize optimizing the likelihood of data over the quality of the samples generated, aiming for faster training times.

  • Denoising Diffusion Implicit Models (DDIM, 2022): A variation of DDPM that uses non-Markovian diffusion processes (meaning the process does not strictly follow a memoryless Markov property) while retaining the same training objectives. DDIM models are significantly faster (10 to 50 times) than DDPM and allow for semantically meaningful interpolations in latent space.

  • Latent Diffusion Model (LDM, 2022): This model performs the diffusion process in a compressed latent space rather than in the pixel space, which can be computationally more efficient and can potentially generate higher quality images.

  • Classifier Guided Diffusion: An extension of diffusion models where guidance is provided by incorporating knowledge about different classes into the diffusion process. This can help in generating images that are more aligned with specific classes, enhancing control over the generation process.


Author

Alexander R.

Information

Last changed