What is the nomenclature in text embedding?
document -> single file such as PDF
corpus -> collection of documents
word -> minimal string of characters with meaning to humans
vocabulary -> union of all words in all of the documents
stop words -> list of common, non informative words, i.e. me, no, the, are,…
token -> string of characters but not necessarily a word -> result of tokenization process
What are common preprocessing techniques?
stop word removal (remove words with little semantic meaning , i.e. a)
map to lower case
punctiation removal (might lose some information i.e. ?, !)
What is tokenization?
demarcating and partitioning
a string of characters
into smaller units called tokens
What is stemming?
reducing inflected words to their stem
dogs -> dog
goes -> go
saying -> say
easliy -> easy
…
=> map semantically equivalent words to same tokens
What is the bag of words algorithm?
we have a corpus
preprocess it (stemming, lowe case, stop word removal)
-> obtain vocabulary as vector of words
-> embed documents as vector wiht same dimensionality as vocabulary
-> if document contains a vocable -> set the corresponding index at embedding vector to 1…
e.g. vocab: [arron, boy, climb, zucchini,…]
Document {aaron eat zucchini}
embedded: [1, 0, 0, 1, …]
What are pros and cons of BoW?
pro:
simple
good as baseline
con:
way too simplistic to capture semantics
sparse
How to compute TF?
Term Frequency
tf(t, d) =
number of times t appears in document d
/
total number of terms in document d
How to calculate IDF?
Inverse document frequency
idf(t) =
number of documents
number of documents with term t in it
How to calculate TF-IDF?
TF-IDF(t, d) =
tf(t, d)
*
idf(t)
What is a popular alternative way to calc tf-idf?
tf(t,d) * log2(idf(t))
What is the intuition behind TF-IDF?
TF:
significance:
how important is term within a document?
IDF:
specifity:
how common is term t in whole corpus
TF-IDF:
the larter tf-idf(t,d) -> the more descriptive is t for document t…
How should we preprocess when using TF-IDF?
lowercase
stemming
tokenization
How can we use TF-IDF for embedding?
create matrix
rows: documents
columns: terms
calculate tf-idf for each row (document)
=> rows are document embedding…
What is word2vec?
revolutionary new techinque
-> word embedding rathre that document embedding…
What is the general idea behidn word to vec?
embed words in multidimensional space to capture its meaning
especially in relation to other embeddings
i.e. man + (queen - woman) = king
On what data do we train word2vec with skipgram architecture?
we have raw text
initialize dictionary from it
encode the words as one-hot encoding vector
we have the training samples
for each word
the set of tuples with neighboring words
up to certain distance c
: (X, y) = {(vi, vi+j) | − c ≤ j ≤ c, j != 0}
How do we train our network?
the goal is to basically maximize the modles performance w.r.t. predicting that two words are in neighborhood of each other
-> so when having input x, predict wether a word is a neighbor
-> using softmax
Do we in practice maximize a likelihood?
no we minimize the negative log of it…
Why do we minimize the negative log instead of maximizing?
mathematically equivalent since log monotonic
implemented optimizers are often minimizers
numerically stable, as p(xi) can become very small
convenience: log(ab) = log(a)+log(b)
=> negative log likelihood loss
What are the word vectors after training our network in word2vec skipgram architecture?
network:
similiarty = BAx
-> EMbedding Matrix A
ENcoding Layer B
=> rows of embedding matrix A (row corresponding to one-hot encoded index) correspond to the word embeddings…
How does the training time between TF-IDF and word2vec compare?
tf-idf: fast
w2v: slow
How does the inference time between TF-IDF and word2vec compare?
w2v: fast (lookup in embedding matrix)
How do semantics between TF-IDF and word2vec compare?
tf-idf: summarization via word counts
w2v: semantics w.r.t only neighboring word
How do embedding between TF-IDF and word2vec compare?
tf-idf: document embedding
w2v: word embedding
How does sparsity between TF-IDF and word2vec compare?
tf-idf: sparse, size |V|
w2v: sdense, size h
What are some shared limitations between word2vec and tf-idf?
no word sense disambiguation (i.e. cell(biologische zelle) is cell (gefängnis zelle) is cell (handy)…
cannot handle out of vocabulary words
ignores order of words (up to neighbor relationships)
based on word counts (tf-idf) or co-occurence (w2v) -> far from real semantic understanding
What options exist for sentence / document embedding using w2v?
weighted
aggregation via RNN
advanced methods e.g. doc2vec
How to simple embed documents using w2v?
simple summation of the word embeddings of the words in a sentence / document
How to weight sentence / document embedding in w2v?
weighted summatino via TF-IDF
-> sum over all words in sentence
TF-IDF (word, sentence) * wordembedding
Last changed2 years ago