What is a document?=
single document file (e.g. mail, pdf, doc, …)
What is a corpus?
collection of documents
What is a word=
minimal string of characters t
with meaning to humans
What is a vocabulary?
union of all words in all of the documents
What are stop words?
list of common, non-informative words
e.g. me, no, the, are….
What is a token=
string of characters
but not necesarrily a word
-> result of tokenization process…
What are common preprocessing techniques?
stop word removal
map to lower case
punctuation removal
What is tokenization?
demarcating and partitinoing string
into smaller units…
What is stemming?
reducing inflected words to their word stem
=> map semantically equivalent words to same tokens
What is bag of words and how doers it worek?
assign each word in the vocabulary a one hot encoded vector
=> eg sort alphabetically
Pros and cons of BoW?
pro: simple (e.g. good as baseline for further methods)
con: simple, sparse
What does TF-IDF express?
significance of a word for a document in the context of a corpus
-> TF: significance of word for document
-> IDF: specifity of word for the document w.r.t. whole corpus
=> larger is better…
How should one preprocess when using TF-IDF?
lowercase
stemming
tokenization
How to vecrtorize documents using TF-IDF?
calculate TF-IDF for each word (in vocabulkary -> also words not in document but corpus…)
-> create vector from it…
Use-Cases of TF-IDF?
examine impact of copy paste from stack overflow
-> collcet set of vuln. code snippets
-> apply TF-IDF
-> train classifier
are binaries related / malicious?
What is the idea behind word2vec?
learn the semantics of a word
-> use this knowledge to predict probability of other words being close to a specific word (based on learning on text)
-> represent this as softmax vector (probability for each other word to be related to this word…)
How does word2vec training work=
we have sequence of words
-> initialize a dictionary of the words
-> encode each word as 1-hot vector
model two layer dense network and train:
(X,y) = (input, neighbor)
-> use several neighbors with distance c…
apply softmax loss to it…
Formula word 2 vec and explanation
maximize
1/l (number of words)
Sum over all words
Sum over neighbors
p_theta(v_i+j|v_i)
=> maximize for each word
=> the confidence that actually neighboring words are neighbors
-> as p_theta(v_i+j|v_i) =
v_i+j * softmax(f(v_i))
=> thus if it has high confidence -> the respective entry in the softmax output should be close (optimally) 1….
How is word2vec in practice adjusted?
not maxximizatin
but minimize negative log ….
=> same outcome but better optimized minimizers….
numerically stable…
Compare TF-IDF and Word2Vec in tertms of training time
TF-IDF fast
Word2Vec slow
Compare TF-IDF and Word2Vec in tertms of inference time
word2vec -> fast (lookup after training…)
Compare TF-IDF and Word2Vec in tertms of semanticws
TF iDF: summarization via word counts
word2vec: semantics w.r.t. only neigbhoring word
Compare TF-IDF and Word2Vec in tertms of embedding
TF-IDF -> document embedding
Word2vec -> word embedding
Compare TF-IDF and Word2Vec in tertms of sparsity
TF: sparse, size |V|
wodd2vec: dense, size h
What are shared limitations between word 2 vec and tf idf?
no word sense disambiguation (cell as in prison, biology, telephone, battery,…)
cannot handle out of vocab words
ignores order of words (up to neighboring relationship=
based on word conuts or word co-occurence -> far from real semantic understanding…
How can one embed complete documents with word2vec?
simple summation -> 1/n * Sum over embedding vectors (with n being number of words in document)=
weighted summation via TF-iDF
-> Sum over words i in document j
TF-IDF(i,j) * embedding of word i
Last changed2 years ago