08. IR- Knowledge Destillation

by Jo J.

What is Knowledge Distillation, and why is it used in NLP?

Knowledge Distillation is a technique where a smaller model (student) learns from a larger, more complex model (teacher).

Advantages:

Why is Knowledge Distillation important for large NLP models?

Challenges with large models:

Knowledge Distillation solves these by creating a smaller, optimized model that retains most of the teacher model’s performance.

What are the two levels of supervision in Knowledge Distillation?

Levels of Supervision:

Final output scores (soft labels) → The student model learns from the teacher’s probability distribution.
- Easy to use ensemble of teachers
- Enables to operate architecture independent
Intermediate representations → The student model mimics the teacher’s hidden layers.
- Locks us into a certain architecture
- Potentially similar parameter setting
- Much more supervision signals than just final score

What is DistilBERT, and how does it compare to BERT?

DistilBERT is a smaller, distilled version of BERT that retains 97% of its effectiveness.

Key Differences:

Benefit: DistilBERT is faster and lighter, making it ideal for real-world applications.

Name 3 steps for IR Destillation?

Steps in IR Distillation:

This improves retrieval models without needing extensive labeled datasets.

What is Margin-MSE Loss, and why is it used in distillation?

Margin-MSE Loss optimizes the margin between relevant and non-relevant passages, rather than absolute scores.

Benefits:

Makes no assumption about the model architecture -> We can mix and match different neural ranking models
We can pre-compute teacher scores once and reuse them

How is Margin-MSE Loss applied in Dense Retrieval?

Steps:

A teacher model ranks documents based on relevance.
The student model learns to maintain ranking differences between relevant and non-relevant passages.
This allows the student to generalize well without memorizing exact scores.

What is KL-Divergence, and how is it used in Knowledge Distillation?

Kullback-Leibler (KL) Divergence measures the difference between two probability distributions.

In Knowledge Distillation, KL-Divergence we use the class probabilities from the teacher and student network as P and Q

What is Cross-Domain Knowledge Distillation, and how is it evaluated?

Goal: Train a dense retrieval model that works across different domains without additional fine-tuning -> Zero-shot transfer

Evaluated using the BEIR Zero-Shot Benchmark, which tests how well a dense retrieval model generalizes to unseen data.

Why do DR models struggle on Zero-shots? 3 Reasons

1. Generalization: DR models just don’t generalize to other query distributions

2. Quirks: MSMARCO training data contains to many quirks, specific to a collection -> Need adaption to training data

3. Pool Bias: Many (older or smaller) collections are heavily biased towards BM25 results -> ultimately needs re-annotation campaigns

Last changed
6 months ago