What is Knowledge Distillation, and why is it used in NLP?
Knowledge Distillation is a technique where a smaller model (student) learns from a larger, more complex model (teacher).
Advantages:
Reduces model size while retaining effectiveness.
Helps deploy NLP models efficiently on limited hardware.
Allows faster inference while keeping high accuracy.
Why is Knowledge Distillation important for large NLP models?
Challenges with large models:
Require high computational power.
Slow inference speed.
Difficult to deploy on edge devices.
Knowledge Distillation solves these by creating a smaller, optimized model that retains most of the teacher model’s performance.
What are the two levels of supervision in Knowledge Distillation?
Levels of Supervision:
Final output scores (soft labels) → The student model learns from the teacher’s probability distribution.
Easy to use ensemble of teachers
Enables to operate architecture independent
Intermediate representations → The student model mimics the teacher’s hidden layers.
Locks us into a certain architecture
Potentially similar parameter setting
Much more supervision signals than just final score
What is DistilBERT, and how does it compare to BERT?
DistilBERT is a smaller, distilled version of BERT that retains 97% of its effectiveness.
Key Differences:
6 layers instead of 12.
Retains general-purpose nature.
Trained with knowledge distillation to achieve high efficiency.
Benefit: DistilBERT is faster and lighter, making it ideal for real-world applications.
Name 3 steps for IR Destillation?
Steps in IR Distillation:
Train a teacher model on binary loss.
Use the teacher’s output scores.
Train a smaller student model using those scores.
This improves retrieval models without needing extensive labeled datasets.
What is Margin-MSE Loss, and why is it used in distillation?
Margin-MSE Loss optimizes the margin between relevant and non-relevant passages, rather than absolute scores.
Benefits:
Makes no assumption about the model architecture -> We can mix and match different neural ranking models
We can pre-compute teacher scores once and reuse them
How is Margin-MSE Loss applied in Dense Retrieval?
Steps:
A teacher model ranks documents based on relevance.
The student model learns to maintain ranking differences between relevant and non-relevant passages.
This allows the student to generalize well without memorizing exact scores.
What is KL-Divergence, and how is it used in Knowledge Distillation?
Kullback-Leibler (KL) Divergence measures the difference between two probability distributions.
In Knowledge Distillation, KL-Divergence we use the class probabilities from the teacher and student network as P and Q
What is Cross-Domain Knowledge Distillation, and how is it evaluated?
Goal: Train a dense retrieval model that works across different domains without additional fine-tuning -> Zero-shot transfer
Evaluated using the BEIR Zero-Shot Benchmark, which tests how well a dense retrieval model generalizes to unseen data.
Why do DR models struggle on Zero-shots? 3 Reasons
1. Generalization: DR models just don’t generalize to other query distributions
2. Quirks: MSMARCO training data contains to many quirks, specific to a collection -> Need adaption to training data
3. Pool Bias: Many (older or smaller) collections are heavily biased towards BM25 results -> ultimately needs re-annotation campaigns
Last changed3 months ago