Definition Language Model
Model that assigns probabilities to sequence of words
What is an N-gram language model, and how does it work?
N-gram models predict the next word using the last N words.
Examples:
Estimate probabilities by counting occurrences: P(word | beginning of sentence) ~ C(complete sentence) / C(beginning of sentence)
Use pseudo-words for sentence beginning <S>
Definition N-Gram
Sequence of n tokens
What are the 2 challenges of using N-Gram models?
Challenges:
Probabilities are ≤ 1, so multiplying many small probabilities → Numerical underflow.
Rare or unseen N-Grams → Probability becomes zero, affecting generation & evaluation.
Solution:
Store log probabilities instead of raw probabilities.
Logarithms prevent numerical underflow while preserving ranking order.
What are the two main approaches to evaluate a language model?
Extrinsic evaluation:
Test LM in an actual application (e.g., speech recognition).
Expensive but measures real-world impact.
Intrinsic evaluation:
Evaluate LM using a gold standard test set.
Prevent test data from appearing in training.
How do N-Gram models generate text? (4 Steps)
Start with a seed word or phrase.
Predict the next word based on conditional probabilities.
Sample from the probability distribution.
Repeat until completion.
How to calculate Perplexity
Minimizing perplexity -> maximizing probability
What is the Out-of-Vocabulary (OOV) problem, and why is it a challenge and name 2 solutions?
Problems caused by OOV words:
During evaluation: Test set contains unseen words → Probability becomes zero → Perplexity undefined.
During generation: LM cannot generate words it has never seen.
Solutions:
Replace rare words with <UNK> (unknown token).
<UNK>
Use smoothing techniques to assign probabilities to unseen words like Laplace-Smoothing
What are 5 smoothing techniques, and why are they needed in N-Gram models?
Purpose: Assign some probability to unseen N-Grams to prevent zero probabilities.
Smoothing techniques:
Use <UNK> -> Convert OOV word and rare words to <UNK>
Laplace Smoothing → Add 1 to every count.
Add-k Smoothing → Add a smaller fraction kkk (e.g., 0.5, 0.05).
Backoff → Use smaller N-Grams when larger ones are missing.
Interpolation → Combine probabilities from different N-Gram sizes.
🚀 Effect: Prevents zero probabilities & improves generalization.
What are the limitations of word-level tokenization, and how does subword tokenization solve them?
Word-level tokenization problems:
Struggles with out-of-vocab (OOV) words
Fails to generalize well to unseen or rare terms
Requires a larger vocab to account for all word variations
Subword tokenization solution:
Breaks words into smaller units -> flexible tokenization
Guarantees near-complete coverage, even for unseen text
What is WordPiece Tokenization, and how does it differ from BPE?
WordPiece Tokenization:
Similar to Byte Pair Encoding -> Initalizes vocabulary with every char in the training data
Learns a fixed number of merge rules.
Difference from BPE:
Select symbol pairs that maximizes training data likelihood, not just frequency.
Avoids merging symbols that reduce overall probability.
What is SentencePiece Tokenization, and how does it differ from WordPiece?
SentencePiece Tokenization:
Does not assume spaces separate words.
Treats spaces as characters.
Uses _ (underscore) to represent spaces and add this to the base vocabulary
_
🚀 Advantages:
Works for languages without spaces (e.g., Chinese, Japanese).
No need for pre-tokenization.
Last changed3 months ago