Why are count-based language models (e.g., N-Grams) insufficient?
Limitations of count-based models:
Struggle with long-distance dependencies.
Require large corpora to estimate probabilities accurately.
Have an exponential growth in model size as context length increases.
How does the Fixed-Window Neural Language Model work?
Training: Basically optimizing its parameter O such that it assigns high probability to the target word
What are the 3 steps for feedig text to a neural net?
Steps for converting text into input:
Turn each word into a unique index.
Convert the index into a one-hot vector.
Use matrix multiplication to lookup the word embedding.
What are remaining problems of Fixed-window neural LM? (3)
Fixed window is too small
Enlarging window enlarges W (Window can never be large enough)
It’s not deep enough to capture nuanced contextual meanings
How do Neural Language Models improve upon N-Grams? (2)
Key improvements:
Tackles sparsity problem by learning dense embeddings.
Model size is O(n) [not O(exp(n)) with n beeing the window size]
Neural LMs share information about semantically similar prefixes and overcome sparsity issue (N-Grams treat all prefixes independent)
Effect: Better generalization to unseen text.
What are the challenges of RNNs?
Limitations of RNNs:
They quickly forget protions of the input
Vanishing gradients → Hard to retain long-term dependencies.
Exploding gradients → Large updates destabilize training.
Difficult to parallelize → Must process text sequentially.
-> Solution: Use Self-Attention (Transformers).
What is the role of Self-Attention in language modeling?
Self-Attention helps models focus on relevant words in a sentence.
Benefits:
Creates context-aware representations.
Helps maintain long-range dependencies.
Better runtime for sequential opertaions O(1) vs. O(n) for recurrent.
How is a Transformer trained for Language Modeling?
Training process:
Each position predicts the next token (shift input right by one).
Compute token-wise probability distributions over vocabulary.
Calculate the loss by comparing predictions to actual tokens.
Backpropagate and update parameters.
Steps for training a transformer LM
Compute for each position their corresponding distribution over the whole vocab
Compute for each position the loss between the distribution and the gold output label
Sum the position-wise loss values to obtain a global loss
Using this loss, do Backprop and update Transformer parameter
Use Attention mask to prevent information leakage
Last changed3 months ago