Text Classification Types
Single Label - exactly one label always predicted for each data point
Multi-Label - arbitrary number of labels may be predicted
FastText Architecture
2 Layer NN
First Layer: Embed each word individually
One hot encode
Linear Layer to map to embedding
Mean Pooling over all embeddings
Second layer: Linear Map from pooled to logits
Softmax Classifier
FastText Challenges
By default its BoW. -> Use N-Grams
Sequence Labeling + Applications
For a given sequence predict a label for each element
Applications
POS Tagging
NER
Text Chunking
POS Tagging + In/Out
ask of predicting PoS-tags for an already tokenized text
Input: tokenized Text
Output: POS Tag per token
Main POS Tags + examples
- Noun (“house”, “dog”)
- Verb (“to jump”, “to run”)
- Adjective (“cold”, “hot”)
- Preposition (“in”, “on”, “by”)
- Determiner (“the”, “a”)
- Conjunction (“and”, “or”)
Open vs Closed POS Tags
Closed Tags
new words rarely added
ex: conjunctons, prepositions
little semantics, express gramm. relationships
Open Tags
Commonly accept addition of words
ex: verb, nouns
“content words”
POS Challenge + Examples
There is no 1:1 between words and Tags
“Can” auxiliary verb or a noun
“will” auxiliary verb, a noun or a proper noun
Universal POS Tags
Each language has different tags
makes it hard to compare, work cross-lingual
Development of universal tagset for 22 different languages
Multilingual POS Tagging
Low resource POS Taggin
Naive POS Tagger Architecture + Limitations
Embed each word of sentence
Linear map to embeddings
Linear map + softmax for classification
-> Word order not utilized “I can open the can”
POS Tagger Loss Computation
Get log-softmax activations for each token
Compare to index of the gold tag
Determine indices using tag dictionary
SingleWordTagger Init Pseudocode
class SingleWordTagger (torch.nn.Module):
# 1
def __init__(self, vocab_size : int, num_tags: int, embedding_dim : int):
super().__init__()
self.vocab_size = vocab_size
self.vocab_size
= vocab_size
self.num_tags = num_tags
self.num_tags
= num_tags
self.embedding_dim = embedding_dim
self.embedding_dim
= embedding_dim
# 2
self.embedding = torch.nn.Embedding( vocab_size , embedding_dim )
self.embedding
= torch.nn.Embedding( vocab_size , embedding_dim )
self.linear = torch.nn.Linear( embedding_dim , num_tags)
self.linear
= torch.nn.Linear( embedding_dim , num_tags)
self.loss_function = torch.nn.NLLLoss()
self.loss_function
= torch.nn.NLLLoss()
SingleWordTagger Forward Pseudocode
#1
def forward(self, tokens: List[int] , pos_tags: Optional[List[int]] = None):
log_probs_list = []
#2
for token in tokens:
embedding = self.embedding(torch.tensor([ token]))
features = self.linear(embedding )
log_probs_list .append(torch.nn.functional.log_softmax( features, dim=-1))
#3
log_probs = torch.cat( log_probs_list , dim=0)
result = {"log_probs" : log_probs }
#4
if pos_tags is not None : result["loss"] = self.loss_function( log_probs , torch.tensor( pos_tags))
return result
1. Forward pass takes as input:
- list of token indices
- optionally, list of PoS tag indices
2. For each token in sentence:
- embed
- linear map
- activate
- append activation to list
3. Concatenate activation vectors into a matrix
4. Only during training: Compute loss by comparing
- activations
FixedWindow Tagger
Classify POS Tags based on word and context
word embedding + embedding of surrounding tokens
concatenate embedding vectors
Classification on concatenated vector
Window size = 1 (one word before and after)
FixedContextTagger Init Pseudocode
class FixedContextWordTagger(torch.nn.Module):
def __init__(self, vocab_size: int, num_tags: int, embedding_dim: int, context_size: int):
self.embedding = torch.nn.Embedding(vocab_size, embedding_dim)
self.context_size = context_size # Size of the context window
# The input for this layer is the concatenation of all relevant tokens (i.e. the token
itself + context_size tokens to the left + context_size tokens to the right)
self.linear = torch.nn.Linear(embedding_dim * (1 + 2 * context_size), num_tags)
1. Linear map from context window to number of tags
Limitations - Fixed Window POS Tagger
Classification may lack important context
Example:
“mean” most cases VERB
but “a lean mean fighting machine”
At context = 1 most important words not included
Syntax
the study of how words and morphemes combine to form larger units such as phrases and sentences
Syntax is fundamentally language specific.
Syntax vs Semantics
Syntax: grammatically well formed
Semantics: the content makes sense
Ex: “Colorless green ideas sleep furiously”
Word Semantics
Meaning of words.
“great” and “good” are both positive terms
“movie” and “film” are synonyms
“dialogue” and “characterisation” are aspects of movies
Word Semantic Approaches
Lexical - meanings of individual words and their relationships (Synonyms, Antonyms)
Distributional - representing the meaning of words based on the contexts in which they appear
WordNet
Lexical Databases
Words manually groupdes into sets of congitive synonyms
WordNet Limitations conceptual + practical
Conceptual
Manually constructed
Discrete senses (what is a sense?, what a synonym)
limited relationships modeled (lunch -> restaurant
Do perfect synonyms exist? (H20 - Water)
Practical
How to connect a word in a sentnce to structured knowledge in WordNet?
Ex: “the movie was good”, which good?
Similarity in WordNet
WordNet is hierarchical (tree structure)
Similarity: number of steps on shortest path between two words.
WordNet Hypernym
A hypernym is a categorical description that groups words.
Examples
Car and motorcycle are both “motor vehicle”
tie is a type of “neckwear”
WordNet - Synsets
Group lexemes that are quasi-synonyms
Ex: “tie” -> “necktie”
Lexemes
Cognitive synonym
Ex: “tie” and “ties”
Lemma: “tie”
Lexmes
neckwear
equal scores
relationship
Distributional Semantics + Motivation
representing the meaning of words based on the contexts in which they appear
Analyse a lot of text to derive lat. repr. of words
Motivation
Manual Specification is too expensive -> automatic
Discrete representation is problematic -> latent representation
Distributional Hypothesis
difference of meaning correlates with difference of distribution
Distributional Semantics Approaches
Count-Based - Characterise through co-occurance with other words. (Co-Occurance matrix / vectors)
prediction based - Predict co-occurring words in a context (CBOW)
Co-Occurance Matrix
Capture frequency with which pairs of words appear together
rows / columns represent words
cell value (i,j) is number of times word i appears in the context of word j
Has a context window
Co-Occurance Matrix Process
Build Vocabulary (length n)
Create nxn matrix
count how often two words co-occur in the corpus within a fixed-size word window of size k
Co-Occurance Limitations
Frequent words
Distributions dominated by statistically insignificant co-occurrences (example. “the” is everywhere) -> Solved by PMI
Huge Vocabulary (easily >100k words) -> 100k dim vector -> need for compression (SVD)
PMI Matrix
Pointwise Mututal Information Matrix
statistical significance of two words appearing together
pmi(x;y) = log(p(x,y/p(x)p(y))
p(x), p(y) independent probability of occurance
p(x,y) co-occurance probability
SVD
Singular Value Decomposition
Transformation of high-dimensional into low-dimensional space
Singular value: best axes to project on with minimal projection errors
SVD for Co-Occurance
Sparisity Reduction: Converts a large matrix with many zeros to a dense representation
Noise Reduction
High-order co-occurrence:
First-order co-occurrence means that two words directly co-occur
Higher-order co-occurrence means that two words co-occur with similar words
latent Meaning
mapping captures the latent (hidden) meaning in the words and the contexts (embedding)
Similarity of Words
Cosine Similarity
cos() = (A*B)/|A||B|)
A, B are embedding vectors
similar = close to 1
not similar = close to 0
Word Embedding + Properties
umerical representation of a word in a lower-dimensional space
similar words tend to be closer
certain algebraic operations can produce meaningful result (king+woman = queen)
Embedding Methods (name)
One Hot
Word2Vec
BERT
GPT
Embedding Evaluation
Intrinsic: Measure whether certain properties exist in embeddings
Word Similarity correlate to human judgement
Analogies
Extrinsic: See if embedding produces downstream NLP task
Intrinsic Evaluation Pro / Cons
Pros: fast to test, checks specifically for certain properties
Cons: no clear correlation to usefulness in downstream tasks
Extrinsic Evaluation Pro / Cons
Pros: evaluates usefulness for actual task
Cons: slower to test, harder to isolate embeddings from rest of system
Naive Embedding
Each word through the same embedding layer individually
Task specific Embeddings
Any multilayer NN with embedding layer will learn dense representations of word semantics
Difference between Word2Vec and FastText
Word2Vec multipurpose representation
FastTest trained for sentiment will learn embedings exactly for that
Pre-trained vs Random initialisation
“from scratch” -> randomly initialised
pre-initialise -> pre-training task (Word2Vec) -> Transfer Learning
Idea: Words that occur together tend to have similar meanings
NN with one hidden layer
One hot encoding as input and output
Pre-Training for NLP PRerequesits
Desiredata
task should be exceedingly difficult -> require general language understanding
Near endless amount of training data
Ex: Language Modelling
Skip-Gram + In/Out
Technique to learn word embeddings
Developed with Word2Vec
Basic Model
Input: A word
Output: A Co-occuring word
Training: Fixed Sliding window over text
Skip-Gram Architecture
Two linear layers
First layer (embedding layer) takes one-hot and projects to smaller
Second Layer takes smaller vector (word embedding) and predicts co-occuring word
Projection to vocab size + softmax
Named Entitiy Recognition
Identifying and Classifying named entities of predefined categories such as people, organisations….
Goal: extraction of structured information
Skip-Gram Embeddings
Throw away second layer and only use first layers output as embeddings
Naive NER Tagging Problems
Entitiy Boundaries -> Encode entity boundaries within tag (BIO-2)
First token of entity starts with B, all other with I
NER Tagging schemes
BIO-2
BIOES
BIOES Tagging
- B: Beginning of an entity class
- I: Inside an entity class
- E: End of an entity class
- S: Single-token entity
- O: Out (no entity)
NER Evaluation
Benchmark Dataset
Metrics
Precision, Recall, F1-Score
Confusion Matrix in NER
TP: Correct Tag predicted
FP: Wrong Tag predicted
FN: O predicted, even tho there is an entity
TN: O token correctly predicted (easy, most tokens are O)
Entitiy vs Entity Mention
Entity: Unique object which is referenced by one or multiple entity mentions
Mention examples
Apple Inc. is the world largest…
The new Apple iphone X…
-> All mention the entity Apple Inc.
FLERT
SOTA NER Tagger
Finetunes transformer on document level context
NLI
Natural Language Inference
label pairs of sentences as entailing/contradictory/neutral
Classification Task
NLI Approaches
Dual Encoder Architecture
Two Encoders for the premise and hypothesis -> Concat outputs -> Softamax Classifier
Limitation: both encoded seperately
Encode as single string
Special seperator token int he moddle
same encoder for both
Softmax Classifier on Output
GLUE
General Language Understanding Evaluation
11 dif. NLP tasks, mostly NLI
Text Sumarisation + Types
Task: Summarise one (or multiple) texts in a short paragraph
Extractive:
Identify key passages in text and use those in summary
Can be implemented with Sequence Labeling
Abstractive
Summarise in new Words
Can be implemented with Seq2Seq
Seq2Seq + Applications
family of ML approaches for NLP
input: Text
Output: text
Machine Translation
Text Summarisation
Text/Code Generation
Seq2Seq Evaluation
BLEU Score
Compares machine written translateion to one or multiple gold translations
Compares the machine-written translation $ŷ$ to one or multiple gold translation(s)
n-gram precision
separately compute how many n-grams (usually 1-4 grams) of ŷ are found in each gold translation
Chose y with the highes overlap as reference
Percentage of n-grams found in reference
RNN for Seq2Seq
Idea: use two different RNNs with seperate vocabularies
encoder: only learns to predict a final encoder hidden state
decoder: trained with a language modeling objective, conditioned on final encoder state
Encoder produces hidden state as input for decoder
Decode step by step and use previous prediction as input into next step
BLEU Limitations
many ways to validly translatr
good translation can get low BLEU score because of low n-gram overlap
Multi-Layer RNN for Seq2Seq
Several RNN Layers
each layer is independent RNN cell
Idea: lower RNNs compute lower-level features, and higher RNNs compute higher level features
Seq2Seq Decoding Approaches
Greedy: Step by step with no way to undo decisions
Exhaustive search decoding: Compute all possible sequences and pick the best one
Beam Search Decoding: Keep track of k most probably partial translations till stopping criterion is reached.
Beam Search Decoding
Core idea: On each step of Decoder, keep track of the k most probable partial translations (which we call hypotheses)
k is the beam size (in practice around 5 to 10)
Each hypothesis has score which is log probability
Stopping:
Produce <STOP> token, put it aside and search other hypothesis
Continue till timestep T or at least n hypthesis
Teacher Forcing
Works by using the actual or expected output from the training dataset at the current time step as input in the next time step, rather than the output generated by the network.
Raio: randomly deactivated for a percentage of decoder steps in which predictions instead of gold input are used
Information Bottleneck Problem
Can occur when Encoding of the source sentence is a single vector representation (last hidden state of RNN)
Entire semantics of arbitrary long text must be compressed into single hidden state
RNN Cell might struggle (limited capacity in hidde state)
Difficult learning problem
Seq2Seq Training
Loss computed for the task “next symbol prediction” on the target language sentence
Backpropagation through entire network
RNN Limitations fro Language Modelling
RNNs pass hidden state from step to step (one direction only)
Often we want to model word in full context
Bidirectional RNN also just two glued together RNNs… no deep bidirectionality
Motivation Attention for Seq2Seq
Attention provides Solution for information bottleneck problem
Idea: do not only decode from last hidden state of encoder
Instead: additionally use direct connection to the encoder to focus on a particular part (i.e. words) of the source sequence
Attention in Seq2seq
In Seq2Seq + Attention, information flows through two “channels” from encoder to decoder:
Recurrence: Information is passed from item to item in the sequence, first encoder then decoder (red box)
Attention: Information is passed more directly via weighted sum and attention output (blue)
Attention Mechanism
NN component that helps to focus on relevant part of input data while making predictions
Instead of relying on fixed-lenght representation, model dynamicall wights and combines relevant information from different positions in the input sequence
Attention Approach
Given a sequence model calculates attention scores -> represent importatants of each position
Attention score normalised -> create distribution over the input sequence
model combines input sequence elements (embeddings) witht the corresponding attention weights
Self-Attention
Attention from one state to all states in same set
Computes contextualised representation of word given all other words in sentence
Attention Score Approaches
Basic dot-product attention
Multiplicative attention
additive attention
Benefits of Attention
Improves MT
Solved bottleneck problem
Helps with Vanishing Gradient
By inspecting attention distribution we can see what the decoder was focusing on.
Vanishing Gradients + Solutions
Occur during training of NNs, particular RNNs
when the gradients of the loss function become very small during backpropagation.
leads to very slow / effectively halting learning
arises from the repeated multiplication of gradients that are less than one, leading to exponential decay as they propagate backward through the network
Solutions
Activation functions that maintain gradient magnitude (ReLU)
Other Architectures: LSTMs, Transformers…
Attention Components
Query - transformed representation of input element, ususally input * weight matrix
Key - provides information ab outt eh element that cen be compared or matched against (also input * weight matrix)
Value - information or content associated with input element
Self-Attention vs RNN
Advantages
maximum interaction distance O(1)
Deeply bidirectional
All word representations can be computed in parallel
Disadvantages
Sometimes dont want Bidirectionality
Attention just does weighted averaging (no nonlinearities)
Word order is no longer encoded (its BoW)
Problems of Vanilla Self-Attention
Sometimes want no bidirectionality -> Solution: Mask future states
No non-linearities -> Add small feed-forward network between layers (linear followed by ReLu)
No positional information -> Add position representation to the inputs
Self-Attention Positional Encodings
Idea: represent each position as positional vector
Add positional vector to Value, Key and Query
Self-Attention Block (Parts)
Self Attention
Positional Encodings
Feed Forward
Residual Connection
Layer Normalisation
Residual Connections
Additional direct connections in the network that enable the gradient to bypass attention blocks
- thought to make the loss landscape smoother
Cut down on uninformative variation in hidden vector values by normalising to unit mean and standard deviation within each layer
Transformer Idea
Only use attention in encoder + decoder but no RNN
Transformer Architecture (components)
Encoder
Decoder
Masked Attention
Cross-Attention
Linear Layer +. Softmax
Softmax
Turns vector of 𝐾 real values (our model predictions) into 𝐾 probabilities that sum to 1:
S shaped curve
Self-Attention Learnable Positional Encoding +Pro/Cons
most used option
all p_i (positional vector) leanable parameters
one hot of each position -> lean embedding for each position
Pros
each position gets to be learned to fit data
Cons
Cant interpolate to indices outside of defined range
Hard maximum sequence length
RNN
Recurrent Neural Network
Family of NNs
Allow for information to flow in cycles
Multi-Head Attention
Idea: multiple attention heads per layer
Each attention head might focus on some other “aspect” and construct value vectors differently
Just create n independent attention mechanisms and combine output (concat + linear)
RNN intuition
RNNs basically read word by word (transformer see everything at once)
But: RNNs cant go back, only forward
Recurrence in RNNs
Important to remember prior information
Output: hidden state at time t
input 1: representation of input x at time t
input 2: previous hidden state
RNN Strength
model sequences of variable length
RNN Hyperparameters
Input Sequence lenght
Neurons, Layers
Activation Function
Learning Rate
Class threshold
RNN Limits + reasons
Struggle to capture long term dependencies
vanishing gradients
exploding gradients
RNN vanishing gradient solutions
truncated BPTT
weigh regularisation
Gradient clipping
Other activation fucntion (ReLu)
Gated Networks (LSTM)
RNN for Sequence Labeling
Each hidden state through Linear+Softmax to classify POS Tag for example
RNN for Text Classification
Final hidden state as input for softmax classifier
Mean of all hidden states as input to softmax classifier
Exploding Gradient
Happens when large error gradient accumulate during BackProp -> excessively large updates/Steps
arises from repeated multiplication of gradients through many layers -> exponential growth
Activation functions
Architecture
Seq2Seq task
translate one sentence to another language
MT Challenges
Source sentence needs to be fully understood (less forgiving then sentiment analysis)
Semantic Differences (concepts that dont exist in every language)
Underspecification - Languages differ in what can be semantically underspecified
Ambiguities - one word, multiple translations (tie)
Multiple Languages MT
earlier: one model for one type of translations
today: one model for multiple languages
Specified with special token which language to translate to. 2
Seq2Seq Model Translation Init
init (source_dictionary, target_dictionary, source_RNN_hidden_size, target_RNN_hidden_size, embedding_size)
self.source_embeddings = Embedding(source_dictionary.size(), embedding_size)
self.source_rnn = LSTM(embedding_size, source_RNN_hidden_size)
self.target_embeddings = Embedding(target_dictionary.size(), embedding_size)
self.source_rnn = LSTM(embedding_size, target_RNN_hidden_size)
self.prediction_head = Linear(target_RNN_hidden_size, target_dictionary.size())
Language Modeling + Architecture Types
Language modeling is the task of predicting what word comes next.
Models
Statistical Models (n-grams)
NN-based (RNN, Transformer)
Language Modeling Aplications
MT
Text Gen
Speech Recogn.
Language Model (general Def)
System that assigns probability to a piece of text.
e.g. Given some sequence of words, what are the probabilities for all other possible next tokens
LM - Probability of Text + Example Applications
LM assigns probabilities to sequences of words
effectively modeling the likelihood of textual statements
OCR: what token is most likely to be at a certain point
MT: Knowledge in model to help translattion
AutoComplete: Complete search query
LM Special Tokens
Mark e.g. end or beginning of doc / sentence
No word before, but some words more likely in beginning of doc
Introduce reserved symbold "<Start> / <Stop>
Atomic Units of Language + tradeoff
Word
Subtoken
Character
Tradeoff: Vocabulary size vs sequence length
Language Modelling Training data
Raw text enough.
Just predict the next token, given a sequence of tokens
essentially sliding window over raw text
Subtoken (idea)
Instead of modeling language as distribution over words, model as distribution over subtokens
Subtoken Level Advantages
Learns meaning of prefixes / endings (-ed, -ing)
Learns meaning of spelling variations (cool, cooool)
An unkown word might contain parts that allow for partial inference of its meaning (“guesstimate”)
Create Subtokens
No Tokenizer
Done by using heuristics
Tokensiation
Splitting a text into tokens
Input: text
output: list of tokens (each representing a word)
Tokenisation Challenges
Correctly tokenise apostrophes
Clitics: syntactially independent but phonologically dependent words
Compounds
Clitics
syntactially independent but phonologically dependent words
ex: gimme -> give me
Kontext: Tokenisation
Compounds + Ex
independent but phonologically dependent words
open ex: “high school”, “must-have”
closed ex: “Abwasserbehandlungsanlage”
Multi-words units +. Example
multi-words that function as a single syntactic unit
San Francisco
by and large
-> are split by tokenizer
Vocabulary
all unique words in corpus
UNK rare words
optionally) add special tokens like <START> and <STOP> for the beginning and end of documents
This vocabulary is used both to encode the inputs and the outputs
Character Level Models
Predict next character
Text Generation
Use generated words as input for next time step (opposite of teacher forcing)
This way, you sample one word at a time to generate text
Sample from multinomial distribution at each step (i.e. generation is non-deterministic
N-Grams + Types
continous sequences of n items froma given sample of text or speech
Types
Unigrams
Bigrams
Trigrams
Higer-order Ngrams
Word2Vec Idea
Words that are used/occur in same contexts tend to purport similar meanings
Neural Network with one hidden layer; [[One-Hot Encoding]] input and output
Distributional Representation
Word2Vec Architecture
One hidden layer NN
two sets of weights
Input / Output one hot
Text Classifcation Evaluation Metrics
accuracy
Precision
F1
Accuracy
Fraction of predictions that are correct
Accuracy = (#correct preds/|Dataset|)
ratio of true positive predictions to the total number of positive predictions
High: few FP errors -> important when cost of FP is high
Precision = (TP/(TP+FP))
Recall
proportion of positive examples that were correctly classified
important when the cost of false negatives is high.
Recall = (TP/(TP+FN))
F1-Score
Balances Precision and Recall, single score that reflects both
reflects both the accuracy of positive predictions and the ability to identify all positive instances.
Useful for imbalanced datasets
Higher F1: Overall better Perf.
F1= (2 * Prec * Rec) / (Prec + Rec)
CE Loss
Measures the performance of a classification model whose output is a probability value between 0 and 1. Lower log loss indicates better performance.
POS Tagger Evaluation Metrics
Language Modelling Evaluation Metrics
Perplexity
BLEU
Perplexity + Intuition
It measures how well a probabilistic model predicts a sample of text.
Perplexity is a measure of uncertainty. A lower perplexity indicates that the model is less "perplexed" by the text and has a better understanding of the language patterns.
Relation Extraction Evaluation Metric
Classification Metrics
Relation Extraction
Goal: identify semantic relationships between entity mentions
- relations can be directed
- there may be no relation (as defined by the application)
RE Approaches
Rule based: Manually create extraction features (tree-matching rule)
ML-Based: For each pair of entities in sentence make classification which (if any) relation holds)
Rule based RE Limitations
Rulesets become quickly complex / unmanagable
Rules easily overfit data
ML Based RE Process
Get all possible entity pairs
Create new sentence for each pair
<X, Y> - ‘X’ was founded by ‘Y’ in Z
<X,Z> - ‘X’ was founded by Y in ‘Z’
Classify generated sentences
(founder, located-in, no-relation…)
ArgMax + use in ML
The argmax function returns the argument or arguments (arg) for the target function that returns the maximum (max) value from the target function.
In ML: most commonly used in machine learning for finding the class with the largest predicted probability.
Softmax and argmax relation
Softmax: Normalises output values (logits) to add up to 1
Argmax: returns highest value of a set of numbers
Relation: Sequential use: First Softmax to normalise, then Argmax to pick the highest one
During training Argmax not used (but loss function),
during inference, argmax
Interpretability
Softmax: Probabilistic interpretation
Argmax: Clear decision
Pooling Layer + Types
reduce the spatial dimensions of the input feature maps, thereby decreasing the number of parameters
Mean
Max
Min
Vector Concatenation vs Pooling
Concatenation - lenght is sum of all concat. vect
Pro: Preserves word order
Neg: Fixed size context only
Pooling - lenght is the same as each vect.
Pro: Arbitrary number of vectors
Neg: Doesnt preserve word oder
Concatenation vs Pooling for POS Tagging and Classifcation - What and why?
Text Classification - Pooling
Why: usuallly benefits from global scope.
POS Tagging - Concatenation
Why: specific context for each word important including word order (sequential information).
BoW Classifier
Count Vectorise Text
1 NN Layer
Activation
Why use UNK
Unkown words
Reduce Sparsity (long tail distribution)
Not using unk, most common downsides
No way to handle out of vocabulary words
Model can overfit on rare words / training not effective
BoW Classifier Conceptual Limits
Cant handle words that are not in training set
Potentially infinite set of words
Creating labels for training is expensive
WordVec vs WordNet - Scalability
Word2Vec more scalable as it is trained on raw text
Lexical representation - Scaling
Very hard as created manually
ex: WordNet
Distributional Representaion- Sclaing
Needs near infinite data
lot of computational ressources
WordNet vs Word2Vec - Ambiguity
Wordnet: Can model the fact that words have multiple meanings through lemmas and lexmes
Word2Vec - cannot model the fact that words have multiple meanings
WordNet vs Word2Vec - Word Similarity
WordNet: Walks along tree (counting steps)
Word2Vec: Distance in highdimensional space. Either euclidean or most commonly cosine distance
Word Similarity with Embeddings
Can view embedding as point in space
use distance metric as measure of similarity
Euclidean
Cosine
Why does FastText use N-Grams + Ex
Its BoW, cannot distinguish between
Not bad at all - its good
Not good at all - its bad.
FastText Init Pseudocode
NLP Pipeline Steps
Text Splitting
Tokenisation
POS-Tagging
Lemmatization
Dependency Parsing
Sentence parsing + Approaches
Segmenting raw text to sentences
Approaches
Rule Based (punctuatione etc.)
Prediction based (binary classifier for punctuation symbol)
Lemmatisation + Ex
Lemmatisation determines the dictionary form of a token
mice -> mouse
cars -> car
found -> find
Morphological Features + Examples
Consists of three levels of representation (ex: “a solution was found”)
Lemma (dictionary form)
find
POS-Tag
VERB
set of features (lexical / grammatical Properties)
past tense, passive voice
Lexical and Grammatical properties + Ex
grammatical Number (“mice” plural of “mouse”
gram. gender (“pilotin” is fem. of “pilot”)
gram. voice (was “found” is passive of “find”)
gram. tense (“searched” is past tense of “search)
Combine Encodings of Word and POS Tag
One hot +Embed word
One hot + Embed POS Tag
Concatenate both -> use as input
Vauquois Triangle
Concept how much analysis to perform on source text
Levels
Direct: Text to text
Transfer Method: Add some syntax
Interlingua: Fully analyze all semantics of source sentence
Vanilla RNN Mathematical Def.
What does LSTM solve?
introduce a memory cell that can selectively remember or forget information over time.
Makes it easier to preserve information over steps -> Solves vanishing gradients
LSTM Activation function for gating + Why
Forgetting: Sigmoid
Produes vector between 0-1
Updating
Tanh - over current input + last hidden state (what to write
Sigmoid - over current input + last hidden state (how much)
Read
Either use last state in Softmax Classifier
Pool over all hidden states and then use softmax classifier
RNN vs Self-Attention conceptual disadvantages
Information Bottleneck
No parallelisation (scaling issues)
No bidirectionality
Text generation Strategies
Sampling
Greedy Decoding
Beam Search
Exhaustive Search
Temperature parameter
For sampling text generation strategy
Controls randomness in sampling
Higher Temp: Model explore more randomly (can lead to more “creative” generations)
Lower Temp: Model exploits more own knowledge, chooses most likely words more consistently
Beam Search Stopping Criteria
until timestep T reached
or n completed hypotheses
Beam Search Decoding Parameters
k - beam size (how many partial translation to keep track of)
t / n - cutoff, either timestep r number of finished hypotheses
Compares the machine-written translation $ŷ$ to one or multiple gold translation
n-Gram Precision
seperately computes how many n-grams of pred are found in each gold translation
chose y with most overlap.
-> percentage of ngrams found in refrence translation
Many ways to translate
good translation can have bad BLEU score due to low overlap
Cross-Attention Benefit
helps in understanding the context between different sequences. For example, in translation, it allows the model to align the source and target sentences effectively.
Cross attention - keys, values, queries
cross-attention uses the encoder states as keys and values, and the decoder states as queries.
Learnable Layers in attention head
Query Projection
Key Projection
Value Projection
Output Projection
Masked attention
restricting a transformer to only see a certain part of an input sequence
Masked Attention usecases
When transformer should only have access to past and present context, not future
a mask is applied to the attention scores to prevent the model from attending to certain positions in the sequence
Language Modelling
Sequence Generation
MLM vs CLM
Context
MLM - Bidirectional
CLM - unidirectional
Training Tokens
MLM- random masked token
CLM - next token
MLM - Bidirectional Transformer (BERT)
CLM - Autoregressive trans. (GPT)
Application
MLM - Not in sequence generation
CLM - Sequence generation
MLM vs CLM for Sentence level semantics
MLM
Bidirectional Context
Global Context (leverages entire sentence to predict token)
NSP objective
objective used during the pre-training of models like BERT (MLM)
It helps the model understand the relationship between pairs of sentences
During training, the model is given pairs of sentences.
The task is to predict whether the second sentence in the pair is the actual next sentence in the original document (labeled as "IsNext") or a random sentence from the corpus (labeled as "NotNext").
Positive Example (IsNext):
Sentence A: "The cat sat on the mat."
Sentence B: "It was looking at the sun outside the window."
Label: IsNext (1)
Transformer Word Embedding Advantages
Description - Word is described using its context
Disambiguation - can capture different meanings of a word based on its context (word2vec: 1 word = 1 embedding)
Rich Semantic Information - incorporate information from the entire sentence
Ability to fine tune - Word2Vec embeddings are static
Word2Vec vs Transformer Embeddings
Static vs Dynamic
Word2Vec has one representation for one word regardless of context
Finetuning
Word2Vec trained on predicting next word, no way to contextually finetune
Information Extraction Pipeline steps
RE
EL
Entity Linking
Link entity mentions to entities from a knowledge base
Entitiy Linking Challenges
Ambiguity
Chicago has 87 entities on wikipedia
Synonyms
Name Variations, Nicknames, Spelling variations
Novel entities
Cold start problem
IE Motivation
Info often not available in structured from
Info changes quickly (manual creation unfeasible)
Inter-Annotator Agreement
Crowdsourcing Setup
Qualification stage - Small set with known ansers
Inject samples with known answers -> Reject annotators if they ge these wring
Redundancy -> 5 annotations per data point
Majority vote
Statistical model
Experience counts
Weak Supervision. + Types
Multiple different weak labeling functions
Different knowledge bases
Models trained on related domains
Heuristic functions
distant supervision +. Process
Idea: existing knowledge bases give us a large list of examples for entities and relations
Process
Lookup stringh “Apple” in KB
Derive NER Type from KB entity
Attach as label to span
Weak Supervision Aggregation Ways
expectation Maximisation
LLMs as Teacher + Problem
Concept of generating labeled trainings samples through an LLM
Problem:
Quality of data is prompt dependent
Lack of creativity
Accuracy of data
Improve Annotation Quality
Train annotators
2 independent annotations per data sample
Calcule inter-annotator agreement
Potentially exclude annotators
Label Noise Sources
Human error - main source, just labeled incorrectly
Poor data - Data of poor quality (blurry image)
HUman label variation - Multiple possible answers / ambiguity
Semi-Automatic annotation - When model picks up wrong singnal (emoji)
Noise Robust Learning
Accept that some part of the training data will be incorrectly labeled
Design machine learning approach to handle
Noise Robust Learning Approaches
Delay Memorisation
Delay overfitting or stop training before it begins
Clean Subset
Filter suspicious data points from datasets
Additional Clean Data
Small set of additionally expert labeled data
All approaches struggle on real noise, filtering noise seems to be most promising
Problem with Noise Simulation
Simulated noise used to test noise robust learnign
But noise not realisitc.
NoiseBench
Evaluate 6 types of noise
NoiseBench Noise Splits
Expert - Original CoNLL-03
Crowd - cheap + non expert
auto - weak + distant supervisino
LLM
Last changed6 months ago