undefined

Buffl

NLP

by Jo J.

Definition Language Model

Model that assigns probabilities to sequence of words

What is an N-gram language model, and how does it work?

N-gram models predict the next word using the last N words.

Examples:

Estimate probabilities by counting occurrences: P(word | beginning of sentence) ~ C(complete sentence) / C(beginning of sentence)
Use pseudo-words for sentence beginning <S>

Definition N-Gram

Sequence of n tokens

What are the 2 challenges of using N-Gram models?

Challenges:

Probabilities are ≤ 1, so multiplying many small probabilities → Numerical underflow.
Rare or unseen N-Grams → Probability becomes zero, affecting generation & evaluation.

Solution:

Store log probabilities instead of raw probabilities.
Logarithms prevent numerical underflow while preserving ranking order.

What are the two main approaches to evaluate a language model?

Extrinsic evaluation:
- Test LM in an actual application (e.g., speech recognition).
- Expensive but measures real-world impact.
Intrinsic evaluation:
- Evaluate LM using a gold standard test set.
- Prevent test data from appearing in training.

How do N-Gram models generate text? (4 Steps)

Start with a seed word or phrase.
Predict the next word based on conditional probabilities.
Sample from the probability distribution.
Repeat until completion.

How to calculate Perplexity

Minimizing perplexity -> maximizing probability

What is the Out-of-Vocabulary (OOV) problem, and why is it a challenge and name 2 solutions?

Problems caused by OOV words:

During evaluation: Test set contains unseen words → Probability becomes zero → Perplexity undefined.
During generation: LM cannot generate words it has never seen.

Solutions:

Replace rare words with <UNK> (unknown token).
Use smoothing techniques to assign probabilities to unseen words like Laplace-Smoothing

What are 5 smoothing techniques, and why are they needed in N-Gram models?

Purpose: Assign some probability to unseen N-Grams to prevent zero probabilities.

Smoothing techniques:

Use <UNK> -> Convert OOV word and rare words to <UNK>
Laplace Smoothing → Add 1 to every count.
Add-k Smoothing → Add a smaller fraction kkk (e.g., 0.5, 0.05).
Backoff → Use smaller N-Grams when larger ones are missing.
Interpolation → Combine probabilities from different N-Gram sizes.

🚀 Effect: Prevents zero probabilities & improves generalization.

What are the limitations of word-level tokenization, and how does subword tokenization solve them?

Word-level tokenization problems:

Struggles with out-of-vocab (OOV) words
Fails to generalize well to unseen or rare terms
Requires a larger vocab to account for all word variations

Subword tokenization solution:

Breaks words into smaller units -> flexible tokenization
Guarantees near-complete coverage, even for unseen text

What is WordPiece Tokenization, and how does it differ from BPE?

WordPiece Tokenization:

Similar to Byte Pair Encoding -> Initalizes vocabulary with every char in the training data
Learns a fixed number of merge rules.

Difference from BPE:

Select symbol pairs that maximizes training data likelihood, not just frequency.
Avoids merging symbols that reduce overall probability.

What is SentencePiece Tokenization, and how does it differ from WordPiece?

SentencePiece Tokenization:

Does not assume spaces separate words.
Treats spaces as characters.
Uses _ (underscore) to represent spaces and add this to the base vocabulary

🚀 Advantages:

Works for languages without spaces (e.g., Chinese, Japanese).
No need for pre-tokenization.

Join Course

Preview

Author

Jo J.

Information

Last changed
6 months ago

Report course

10. LLM - Foundation

Author

Jo J.

Information