04. IR- Word Representation and Data Collection

by Jo J.

What is one-hot encoding, and why is it inefficient for word representation?

One-hot encoding represents words as binary vectors where only one position is 1, and all others are 0.

Inefficiencies:

How do word embeddings improve upon one-hot encoding?

Word embeddings are dense vector representations that capture semantic relationships between words.

Advantages:

Lower dimensionality (100-300 dimensions instead of thousands).
Encodes meaning (similar words have closer vectors).
Mathematical operations can be performed on vectors (e.g., king - man + woman = queen).

What is the difference between Skip-gram and CBOW in Word2Vec?

Skip-gram: Predicts context words given a word. +Works well for rare words.
CBOW: Predicts a word given its context words.+ Faster and works better with smaller datasets.

Why is negative sampling used in Word2Vec?

Negative sampling speeds up training by only updating a small subset of weights instead of the entire vocabulary.

Instead of computing probabilities for all words, it selects a few negative words (random words unlikely to be in the context).

Why is subword tokenization useful in word representation?

Subword tokenization breaks words into smaller units (e.g., unhappiness → un + happy + ness).

Advantages:

How does Byte-Pair Encoding (BPE) work in word tokenization?

BPE merges the most frequent character pairs iteratively to form subwords.

Example: t h a t -> t h at -> th at -> that

3 Main advatages for BPE

What is the difference between pretraining and fine-tuning in word embeddings?

Pretraining: Train embeddings on large, general datasets (e.g., Wikipedia, Common Crawl).
Fine-tuning: Adapt embeddings to specific tasks (e.g., medical terminology, legal documents).

What are some limitations of Word2Vec?

Multiple senses per word -> no contextualization after training
Does not preserve word order / context -> cannto create n-grams (Sparsity problem, not enough training data)

Last changed
5 months ago