undefined

Buffl

ML neu

by Jensen J.

What is a document?=

single document file (e.g. mail, pdf, doc, …)

What is a corpus?

collection of documents

What is a word=

minimal string of characters t
with meaning to humans

What is a vocabulary?

union of all words in all of the documents

What are stop words?

list of common, non-informative words
e.g. me, no, the, are….

What is a token=

string of characters
but not necesarrily a word
-> result of tokenization process…

What are common preprocessing techniques?

stop word removal
map to lower case
punctuation removal

What is tokenization?

demarcating and partitinoing string
into smaller units…

What is stemming?

reducing inflected words to their word stem
=> map semantically equivalent words to same tokens

What is bag of words and how doers it worek?

assign each word in the vocabulary a one hot encoded vector
=> eg sort alphabetically

Pros and cons of BoW?

pro: simple (e.g. good as baseline for further methods)
con: simple, sparse

What does TF-IDF express?

significance of a word for a document in the context of a corpus
-> TF: significance of word for document
-> IDF: specifity of word for the document w.r.t. whole corpus

=> larger is better…

How should one preprocess when using TF-IDF?

lowercase
stemming
tokenization

How to vecrtorize documents using TF-IDF?

calculate TF-IDF for each word (in vocabulkary -> also words not in document but corpus…)
-> create vector from it…

Use-Cases of TF-IDF?

examine impact of copy paste from stack overflow
-> collcet set of vuln. code snippets
-> apply TF-IDF
-> train classifier

are binaries related / malicious?

What is the idea behind word2vec?

learn the semantics of a word
-> use this knowledge to predict probability of other words being close to a specific word (based on learning on text)
-> represent this as softmax vector (probability for each other word to be related to this word…)

How does word2vec training work=

we have sequence of words
-> initialize a dictionary of the words
-> encode each word as 1-hot vector
model two layer dense network and train:
(X,y) = (input, neighbor)
-> use several neighbors with distance c…
apply softmax loss to it…

Formula word 2 vec and explanation

maximize
1/l (number of words)
Sum over all words
Sum over neighbors
p_theta(v_i+j|v_i)

=> maximize for each word

=> the confidence that actually neighboring words are neighbors

-> as p_theta(v_i+j|v_i) =

v_i+j * softmax(f(v_i))

=> thus if it has high confidence -> the respective entry in the softmax output should be close (optimally) 1….

How is word2vec in practice adjusted?

not maxximizatin
but minimize negative log ….
=> same outcome but better optimized minimizers….
numerically stable…

Compare TF-IDF and Word2Vec in tertms of training time

TF-IDF fast
Word2Vec slow

Compare TF-IDF and Word2Vec in tertms of inference time

TF-IDF fast
word2vec -> fast (lookup after training…)

Compare TF-IDF and Word2Vec in tertms of semanticws

TF iDF: summarization via word counts
word2vec: semantics w.r.t. only neigbhoring word

Compare TF-IDF and Word2Vec in tertms of embedding

TF-IDF -> document embedding
Word2vec -> word embedding

Compare TF-IDF and Word2Vec in tertms of sparsity

TF: sparse, size |V|
wodd2vec: dense, size h

What are shared limitations between word 2 vec and tf idf?

no word sense disambiguation (cell as in prison, biology, telephone, battery,…)
cannot handle out of vocab words
ignores order of words (up to neighboring relationship=
based on word conuts or word co-occurence -> far from real semantic understanding…

How can one embed complete documents with word2vec?

simple summation -> 1/n * Sum over embedding vectors (with n being number of words in document)=
weighted summation via TF-iDF
-> Sum over words i in document j
TF-IDF(i,j) * embedding of word i

Join Course

Preview

Author

Jensen J.

Information

Last changed
2 years ago

Report course

NLP-2

Author

Jensen J.

Information