What is a standard technique to do calculations on texts?
embed it in a vector
What is a document?
single document file e.g. pdf, doc, txt or mail
What is a corpus?
collection of documents
What is a word?
minimal string of characters with meaning to humans e.g. apple, ok, hello,…
What is a vocabulary?
union of all words in a document
What are stop words?
list of common, non-informative words
-> english stop words include me, no, the, are,…
what is a token?
string of characters, but not neccesarily a word
-> result of tokenization process
What are common preprocessing techniques in NLP?
stop word removal
map to lower case
puncutation removal (?,!,.,…
What is done at tokenization?
process of demarcating and partitioning string of characters into smaller units -> tokens
=> e.g. i don’t want to sing!
=> whitespace tokenizaiton i - dont - want- to- sing!
=> wordpuncttokenizer i - don - ‘ - t - want - to - sing - !
=> treebank word tokenizer I - do - n’t - want - to - sing - !
What is stemming?
reduce inflected words to their word stem
dogs -> dog
men -> man
goes -> go
loved -> love
saying -> say
easily -> easy
Goal of stemming?
map semantically equivalent words to same tokens
How does bag of words work?
preprocess
turn vocabulary into a vector
-> for a documnent, preprocess it
-> turn into a vector by setting each index of words in the document to 1
Pros and cons bag of words?
pro:
simple (good as baseline)
con:
simple (way too simplistic to capture semantics)
What is TF-IDF?
term frequency - inversed document frequency
traditional NLP technique
set of documents {d_i}
single term t (or token e.g. love) in a document d
compute
=> how important is term within document * how common is term in whole corpus
Describe the tow parts of tf-idf
TF (term frequency)
-> how important is term within specific document (significance)
IDF (inverse document frequency)
-> how common is term t in whole corpus (specifity)
How to calculate TF-IDF?
tf(t,d)
= number of times term t apperars in document d
/
number of terms in documetn d
idf(t)
= number of documents
number of documents with term t in it
TF-IDF(t,d) = tf(t,d) * idf(t)
=> or poppular alternative: tf(t,d) * log_2(idf(t))
what is tf and idf?
inverse documetn frequency
term frequency
What is the intuition behind TF-IDF?
if value larger -> term more descriptive for document…
What preprocessing to do for TF-IDF?
lowercase
stemming
tokenization
How do we get from TF-IDF to an embedding?
have table (rows: documetns; columns: tokens)
-> calculate TF-IDF for each row
-> embedding for documnet is the corresponding row…
What to note about TF-IDF implementation in sk-learn?
has built in (overwritable) preprocessing steps discussed earlier
returns sparse matrix
covab ordered by word occurance, values alphabetically sorted
different implementaiton:
applies log to idf
L2 normalization of vector e_j
…
Wat are use-cases of TF-IDF in cyber securty?
examine impact of copy-paste from stack overflow
collect small set of vulnerable code snippets and manually annotate
apply TF-IDF, train classifier
classify source code on github, find unsecure, copied code at large scale…
given large number of binaries
are they related? (clustering)
are the binaries malicious?
log-file analysis
each line is a document
build vocab over all lines
calculate TF-IDF
=> adv. preprocessing might be neccesary
SMS-Spam…
How to detect malicious binaries with TF-IDF?
run binaries in sandbox and record OS calls
generate vocabulary from generated reports
construct features via TF-IDF, train model
What is a revolutionary new way to embed words?
word2vec
What is the idea of word embedding?
model relation between words -> words as vector…
=> e.g. man + (queen - woman) = king…
Outline of word2vec process
requirement: raw text C = [w0w1w2…]
initialize dictionary D of words w_i where |D| = d
encode word wi as one-hot vector vi = [0,…,0,1,0,…,0] where dim(v_i) = 1 (esentially vector where only index = 1)
model: two layer dense network f(x) ) B(A(x)) := BAx = y
where
A element R^(d,h)
B element R1^(h,d)
training data:
(X,y)
=
{(v_i, v_i+j) | -c <= j <= c, j!= 0}
=>
What does the TF-IDF vector represent?
vector has dimensionality of alphabet
-> per document embedding
-> TF-IDF score for each alphabet word for the specific docuiment
-> significance of each word for the document…
What is the model architecture for Word2vec?
input 1xd (one hot encoded word)
embedding matrix dxh
hidden state 1xh
encoding layer hxd
softmax
output 1xd
What is the training data for word2vec?
the one-hot encoded word as input and the neighbors with a certain max distance (c) as outputs
=> sliding window with window size 2c (c to left, c to right) + 1 (pivot element)
How is the network trained?
given words v0,…,vl-1 as one-hot vectors (I = all indices in the alphabet)
maximize…
1/I (number of words)
Sum over all words
Sum over all window neighbors
p theta (v_i+j | v_i)
-> basically, maximize the probability that for a given word, another word is its neighbor
=> train the parameters in such a way, that they learn to predict if another word is in the neighborhood (sliding window)
where p_theta( v_i+j | v_i) = v_i+j softmax (B(A(v_j))
How is the minimizatoin of word2vec done in practice?
minimize the negative log
Why minimize negative log?
mathematically equivalent, since log monotonic…
implemented optimizers are often minimizers
numerically stable, p(x_i) can become very small
convenience (log(ab) = log(a) + log(b))
Difference TF-IDF and Word2vec?
training time:
fast vs slow
inference time:
fast vs fast (lookup)
semantics
summarization via word counts vs semantics wrt. only neighboring word
embedding:
document j embedding vs word i embedding
sparsity:
spase, size |V| (documents) vs dense, size h (number words)
What are shared limitations of word2vec and TF-IDF?
no word sense disambiguation (cell as in prosion, telephone, biology -> do not differentiate what the word means if it is the same word…)
cannot handle out-of-covabulary words
ignores order of words (up to neighborhood relationship)
based on word counts (TF-IDF) or word co-occurence (word2vec) far from real semantic understanding
How does sentence and document embedding for word2vec word
simple summation of word vectors of sentence
weighted summation via TF-IDF
aggreagate via RNN
advanced methods e.g. doc2vec
Last changed2 years ago