undefined

Buffl

ML ITSecv

by Jensen J.

What is the nomenclature in text embedding?

document -> single file such as PDF
corpus -> collection of documents
word -> minimal string of characters with meaning to humans
vocabulary -> union of all words in all of the documents
stop words -> list of common, non informative words, i.e. me, no, the, are,…
token -> string of characters but not necessarily a word -> result of tokenization process

What are common preprocessing techniques?

stop word removal (remove words with little semantic meaning , i.e. a)
map to lower case
punctiation removal (might lose some information i.e. ?, !)

What is tokenization?

demarcating and partitioning
a string of characters
into smaller units called tokens

What is stemming?

reducing inflected words to their stem
dogs -> dog
goes -> go
saying -> say
easliy -> easy
…
=> map semantically equivalent words to same tokens

What is the bag of words algorithm?

we have a corpus
preprocess it (stemming, lowe case, stop word removal)
-> obtain vocabulary as vector of words
-> embed documents as vector wiht same dimensionality as vocabulary
-> if document contains a vocable -> set the corresponding index at embedding vector to 1…

e.g. vocab: [arron, boy, climb, zucchini,…]

Document {aaron eat zucchini}

embedded: [1, 0, 0, 1, …]

What are pros and cons of BoW?

pro:

simple
- good as baseline

con:

simple
- way too simplistic to capture semantics
sparse

How to compute TF?

Term Frequency
tf(t, d) =
- number of times t appears in document d
- /
- total number of terms in document d

How to calculate IDF?

Inverse document frequency

idf(t) =
- number of documents
- /
- number of documents with term t in it

How to calculate TF-IDF?

TF-IDF(t, d) =
- tf(t, d)
- *
- idf(t)

What is a popular alternative way to calc tf-idf?

tf(t,d) * log2(idf(t))

What is the intuition behind TF-IDF?

TF:
- significance:
  - how important is term within a document?
IDF:
- specifity:
  - how common is term t in whole corpus

TF-IDF:

the larter tf-idf(t,d) -> the more descriptive is t for document t…

How should we preprocess when using TF-IDF?

lowercase
stemming
tokenization

How can we use TF-IDF for embedding?

create matrix
- rows: documents
- columns: terms
calculate tf-idf for each row (document)
=> rows are document embedding…

What is word2vec?

revolutionary new techinque
-> word embedding rathre that document embedding…

What is the general idea behidn word to vec?

embed words in multidimensional space to capture its meaning
especially in relation to other embeddings
i.e. man + (queen - woman) = king

On what data do we train word2vec with skipgram architecture?

we have raw text
initialize dictionary from it
encode the words as one-hot encoding vector
we have the training samples
- for each word
- the set of tuples with neighboring words
- up to certain distance c

: (X, y) = {(vi, vi+j) | − c ≤ j ≤ c, j != 0}

How do we train our network?

the goal is to basically maximize the modles performance w.r.t. predicting that two words are in neighborhood of each other
-> so when having input x, predict wether a word is a neighbor
-> using softmax

Do we in practice maximize a likelihood?

no we minimize the negative log of it…

Why do we minimize the negative log instead of maximizing?

mathematically equivalent since log monotonic
implemented optimizers are often minimizers
numerically stable, as p(xi) can become very small
convenience: log(ab) = log(a)+log(b)
=> negative log likelihood loss

What are the word vectors after training our network in word2vec skipgram architecture?

network:
similiarty = BAx
-> EMbedding Matrix A
ENcoding Layer B
=> rows of embedding matrix A (row corresponding to one-hot encoded index) correspond to the word embeddings…

How does the training time between TF-IDF and word2vec compare?

tf-idf: fast
w2v: slow

How does the inference time between TF-IDF and word2vec compare?

tf-idf: fast
w2v: fast (lookup in embedding matrix)

How do semantics between TF-IDF and word2vec compare?

tf-idf: summarization via word counts
w2v: semantics w.r.t only neighboring word

How do embedding between TF-IDF and word2vec compare?

tf-idf: document embedding
w2v: word embedding

How does sparsity between TF-IDF and word2vec compare?

tf-idf: sparse, size |V|
w2v: sdense, size h

What are some shared limitations between word2vec and tf-idf?

no word sense disambiguation (i.e. cell(biologische zelle) is cell (gefängnis zelle) is cell (handy)…
cannot handle out of vocabulary words
ignores order of words (up to neighbor relationships)
based on word counts (tf-idf) or co-occurence (w2v) -> far from real semantic understanding

What options exist for sentence / document embedding using w2v?