Draw an encoder-decoder model
Encoder-self attention
Computation of encoder attends to both sides (before and after)
Ecoder-decoder attention
At any step of decoder, it attends to previous computations of the encoder
Masked Decoder Self-Attention
At any step of decoder, it attends to decoder’s previous generations
What are the 3 types of attention in a transformer
2 Ways of adapting language models
Tuning: Adapting (modifying) model parameters
Prompting: Adapting model inputs (language statements)
2 Ways for fine-tuning of pre-trained models
Whole model: Run an optimization defined on your task data that updates all model parameters
Head-tuning: Run an optimization defined on your task data that updates the parameters of the model “head”
5 parameter efficient fine-tuning approaches
Additive methods (Adapters, Soft prompts)
Reparameterization-based
Selective
What are Adapters, and how do they improve fine-tuning efficiency?
Adapters add extra trainable parameters to the pre-trained model layers while keeping the base model unchanged.
Key properties:
Only adapter layers are trained, keeping the core model frozen.
Low-dimensional projections reduce parameter count.
Memory-efficient: Only stores adapter parameters per task, not full models.
Trade-off: More memory-efficient than full fine-tuning, but still requires a full forward/backward pass.
The 3 steps in adapter-based fine-tuning
Insert small trainable layers into each transformer block.
Train only these adapter layers, while keeping the main model unchanged.
Store only the adapter parameters, allowing easy switching between tasks.
What is Selective Fine-Tuning, and how does it work?
Selective fine-tuning updates only a subset of model parameters, reducing resource usage.
Name 3 selection criterias for selective fine-tuning
Selection criteria:
Layer depth-based (e.g., fine-tune only deeper layers).
Layer type-based (e.g., update only self-attention layers).
Individual parameter selection (e.g., update only bias terms).
What is BitFit, and how does it improve efficiency?
BitFit is a selective fine-tuning method that updates only the bias terms in:
Self-attention layers.
MLP layers.
Why is BitFit efficient?
Updates only ~0.05% of parameters.
(Maintains most of the model structure while fine-tuning efficiently.)
What are the limitations of fine-tuning pre-trained models?
Limitations:
Requires large labeled datasets for fine-tuning.
More pre-training can reduce the need for labeled data, but increases training time.
Last changed3 months ago