undefined

by Jo J.

Draw an encoder-decoder model

Encoder-self attention

Computation of encoder attends to both sides (before and after)

Ecoder-decoder attention

At any step of decoder, it attends to previous computations of the encoder

Masked Decoder Self-Attention

At any step of decoder, it attends to decoder’s previous generations

What are the 3 types of attention in a transformer

2 Ways of adapting language models

2 Ways for fine-tuning of pre-trained models

Whole model: Run an optimization defined on your task data that updates all model parameters
Head-tuning: Run an optimization defined on your task data that updates the parameters of the model “head”

5 parameter efficient fine-tuning approaches

What are Adapters, and how do they improve fine-tuning efficiency?

Adapters add extra trainable parameters to the pre-trained model layers while keeping the base model unchanged.

Key properties:

Trade-off: More memory-efficient than full fine-tuning, but still requires a full forward/backward pass.

The 3 steps in adapter-based fine-tuning

Insert small trainable layers into each transformer block.
Train only these adapter layers, while keeping the main model unchanged.
Store only the adapter parameters, allowing easy switching between tasks.

What is Selective Fine-Tuning, and how does it work?

Selective fine-tuning updates only a subset of model parameters, reducing resource usage.

Name 3 selection criterias for selective fine-tuning

Selection criteria:

What is BitFit, and how does it improve efficiency?

BitFit is a selective fine-tuning method that updates only the bias terms in:

Why is BitFit efficient?

What are the limitations of fine-tuning pre-trained models?

Limitations:

Requires large labeled datasets for fine-tuning.
More pre-training can reduce the need for labeled data, but increases training time.

Last changed
6 months ago

12. LLM - Adaption