What is the Transformer Architecture, and why is it important?
The Transformer is a neural network architecture that processes sequences without recurrence by using self-attention mechanisms.
Advantages:
More efficient than RNNs since it doesn’t process data sequentially.
Better at capturing long-range dependencies in text.
Foundation for state-of-the-art NLP models (BERT, GPT)
How does self-attention help the Transformer process sequences?
Self-attention allows each word to attend to every other word in a sequence, capturing relationships regardless of distance.
Key Properties:
Helps with contextualization (understanding word meaning based on surrounding context).
Enables parallel processing (unlike RNNs, which process data sequentially).
Why are Transformers computationally expensive?
Self-attention has complexity of O(n^2) due to the pairwise word interactions.
-> This makes processing long sequences expensive, requiring high memory and computing power.
What is multi-head self-attention, and why is it useful?
Multi-head self-attention allows the model to focus on different parts of a sentence simultaneously.
Benefits:
Helps capture multiple relationships between words.
Improves contextual understanding.
What are the 4 main components of a Transformer model?
Self-attention layers → Capture word relationships.
Feedforward layers → Process attended features.
Layer normalization → Stabilizes training.
Positional encoding → Adds order information to words.
How does the Transformer differ from RNNs?
Transformer:
Uses self-attention instead of recurrence.
Processes entire sequences in parallel.
More scalable for large datasets.
RNNs:
Process sequences step by step (slower).
Struggle with long-range dependencies due to vanishing gradients.
What are the special tokens used in BERT?
[CLS] → Classification token (used to represent entire sequences).
[MASK] → Used in masked language modeling (MLM) for training.
[SEP] → Used to indicate a next sentence.
How does BERT’s pretraining process work?
Masked Language Model (MLM) – Predicts masked words in a sentence.
Next Sentence Prediction (NSP) – Determines if one sentence follows another.
After pretraining, BERT is fine-tuned for specific NLP tasks.
What is SPLADE, and how does it differ from BERT?
SPLADE combines BERT’s semantic understanding with sparse, interpretable representations.
-> It generates sparse vectors by activating only relevant terms from a vocabulary.
How does SPLADE generate sparse and interpretable vector representations?
Uses BERT embeddings to encode each input word
Expands dense vectors into mix of tokens
Enforece sparsity of result vectors through activation function -> Keeping only the most relevant terms
Regularization to further control the sparsity
-> This makes SPLADE more explainable and efficient than fully dense retrieval models.
Last changed3 months ago