What is one-hot encoding, and why is it inefficient for word representation?
One-hot encoding represents words as binary vectors where only one position is 1, and all others are 0.
1
0
Inefficiencies:
Very sparse (high-dimensional).
No semantic relationships between words.
Not scalable for large vocabularies.
How do word embeddings improve upon one-hot encoding?
Word embeddings are dense vector representations that capture semantic relationships between words.
Advantages:
Lower dimensionality (100-300 dimensions instead of thousands).
Encodes meaning (similar words have closer vectors).
Mathematical operations can be performed on vectors (e.g., king - man + woman = queen).
What is the difference between Skip-gram and CBOW in Word2Vec?
Skip-gram: Predicts context words given a word. +Works well for rare words.
CBOW: Predicts a word given its context words.+ Faster and works better with smaller datasets.
Why is negative sampling used in Word2Vec?
Negative sampling speeds up training by only updating a small subset of weights instead of the entire vocabulary.
Instead of computing probabilities for all words, it selects a few negative words (random words unlikely to be in the context).
Why is subword tokenization useful in word representation?
Subword tokenization breaks words into smaller units (e.g., unhappiness → un + happy + ness).
Helps with rare words
Allows for better generalization to unseen words
Reduces vocabulary size while preserving meaning
How does Byte-Pair Encoding (BPE) work in word tokenization?
BPE merges the most frequent character pairs iteratively to form subwords.
Example: t h a t -> t h at -> th at -> that
Compresses text efficiently.
Works well for morphologically complex languages.
3 Main advatages for BPE
Compression efficiency: common sequences are treated as single tokens
Adaptability: Encoding can optimize for different types of text
Flexibility: Can handle out-of-vocabulary text
What is the difference between pretraining and fine-tuning in word embeddings?
Pretraining: Train embeddings on large, general datasets (e.g., Wikipedia, Common Crawl).
Fine-tuning: Adapt embeddings to specific tasks (e.g., medical terminology, legal documents).
What are some limitations of Word2Vec?
Multiple senses per word -> no contextualization after training
Does not preserve word order / context -> cannto create n-grams (Sparsity problem, not enough training data)
Last changed3 months ago