undefined

Buffl

NLP

von Jannic H.

Text Classification Types

Single Label - exactly one label always predicted for each data point
Multi-Label - arbitrary number of labels may be predicted

FastText Architecture

2 Layer NN
First Layer: Embed each word individually
- One hot encode
- Linear Layer to map to embedding
Mean Pooling over all embeddings
Second layer: Linear Map from pooled to logits
Softmax Classifier

FastText Challenges

By default its BoW. -> Use N-Grams

Sequence Labeling + Applications

For a given sequence predict a label for each element
Applications
- POS Tagging
- NER
- Text Chunking

POS Tagging + In/Out

ask of predicting PoS-tags for an already tokenized text

Input: tokenized Text
Output: POS Tag per token

Main POS Tags + examples

- Noun (“house”, “dog”)

- Verb (“to jump”, “to run”)

- Adjective (“cold”, “hot”)

- Preposition (“in”, “on”, “by”)

- Determiner (“the”, “a”)

- Conjunction (“and”, “or”)

Open vs Closed POS Tags

Closed Tags
- new words rarely added
- ex: conjunctons, prepositions
- little semantics, express gramm. relationships
Open Tags
- Commonly accept addition of words
- ex: verb, nouns
- “content words”

POS Challenge + Examples

There is no 1:1 between words and Tags

“Can” auxiliary verb or a noun
“will” auxiliary verb, a noun or a proper noun

Universal POS Tags

Each language has different tags
- makes it hard to compare, work cross-lingual
Development of universal tagset for 22 different languages
Applications
- Multilingual POS Tagging
- Low resource POS Taggin

Naive POS Tagger Architecture + Limitations

Embed each word of sentence
Linear map to embeddings
Linear map + softmax for classification
-> Word order not utilized “I can open the can”

POS Tagger Loss Computation

Get log-softmax activations for each token
Compare to index of the gold tag
Determine indices using tag dictionary

SingleWordTagger Init Pseudocode

class SingleWordTagger (torch.nn.Module):

# 1

def __init__(self, vocab_size : int, num_tags: int, embedding_dim : int):

super().__init__()

self.vocab_size = vocab_size

self.num_tags = num_tags

self.embedding_dim = embedding_dim

# 2

self.embedding = torch.nn.Embedding( vocab_size , embedding_dim )

self.linear = torch.nn.Linear( embedding_dim , num_tags)

self.loss_function = torch.nn.NLLLoss()

SingleWordTagger Forward Pseudocode

#1

def forward(self, tokens: List[int] , pos_tags: Optional[List[int]] = None):

log_probs_list = []

#2

for token in tokens:

embedding = self.embedding(torch.tensor([ token]))

features = self.linear(embedding )

log_probs_list .append(torch.nn.functional.log_softmax( features, dim=-1))

#3

log_probs = torch.cat( log_probs_list , dim=0)

result = {"log_probs" : log_probs }

#4

if pos_tags is not None : result["loss"] = self.loss_function( log_probs , torch.tensor( pos_tags))

return result

1. Forward pass takes as input:

- list of token indices

- optionally, list of PoS tag indices

2. For each token in sentence:

- embed

- linear map

- activate

- append activation to list

3. Concatenate activation vectors into a matrix

4. Only during training: Compute loss by comparing

- activations

FixedWindow Tagger

Classify POS Tags based on word and context
word embedding + embedding of surrounding tokens
concatenate embedding vectors
Classification on concatenated vector
Window size = 1 (one word before and after)

FixedContextTagger Init Pseudocode

class FixedContextWordTagger(torch.nn.Module):

def __init__(self, vocab_size: int, num_tags: int, embedding_dim: int, context_size: int):

super().__init__()

self.vocab_size = vocab_size

self.num_tags = num_tags

self.embedding_dim = embedding_dim

self.embedding = torch.nn.Embedding(vocab_size, embedding_dim)

self.context_size = context_size # Size of the context window

#1

# The input for this layer is the concatenation of all relevant tokens (i.e. the token

itself + context_size tokens to the left + context_size tokens to the right)

self.linear = torch.nn.Linear(embedding_dim * (1 + 2 * context_size), num_tags)

1. Linear map from context window to number of tags

Limitations - Fixed Window POS Tagger

Classification may lack important context
Example:
- “mean” most cases VERB
- but “a lean mean fighting machine”
- At context = 1 most important words not included

Syntax

the study of how words and morphemes combine to form larger units such as phrases and sentences

Syntax is fundamentally language specific.

Syntax vs Semantics

Syntax: grammatically well formed
Semantics: the content makes sense

Ex: “Colorless green ideas sleep furiously”

Word Semantics

Meaning of words.

“great” and “good” are both positive terms
“movie” and “film” are synonyms
“dialogue” and “characterisation” are aspects of movies

Word Semantic Approaches

Lexical - meanings of individual words and their relationships (Synonyms, Antonyms)
Distributional - representing the meaning of words based on the contexts in which they appear

WordNet

Lexical Databases
Words manually groupdes into sets of congitive synonyms

WordNet Limitations conceptual + practical

Conceptual
- Manually constructed
- Discrete senses (what is a sense?, what a synonym)
- limited relationships modeled (lunch -> restaurant
- Do perfect synonyms exist? (H20 - Water)
Practical
- How to connect a word in a sentnce to structured knowledge in WordNet?
  - Ex: “the movie was good”, which good?

Similarity in WordNet

WordNet is hierarchical (tree structure)
Similarity: number of steps on shortest path between two words.

WordNet Hypernym

A hypernym is a categorical description that groups words.

Examples

Car and motorcycle are both “motor vehicle”
tie is a type of “neckwear”

WordNet - Synsets

Group lexemes that are quasi-synonyms
Ex: “tie” -> “necktie”

Lexemes

Cognitive synonym

Ex: “tie” and “ties”
- Lemma: “tie”
- Lexmes
  - neckwear
  - equal scores
  - relationship

Distributional Semantics + Motivation

representing the meaning of words based on the contexts in which they appear
Analyse a lot of text to derive lat. repr. of words
Motivation
- Manual Specification is too expensive -> automatic
- Discrete representation is problematic -> latent representation

Distributional Hypothesis

difference of meaning correlates with difference of distribution

Distributional Semantics Approaches

Count-Based - Characterise through co-occurance with other words. (Co-Occurance matrix / vectors)
prediction based - Predict co-occurring words in a context (CBOW)

Co-Occurance Matrix

Capture frequency with which pairs of words appear together

rows / columns represent words
cell value (i,j) is number of times word i appears in the context of word j

Has a context window

Co-Occurance Matrix Process

Build Vocabulary (length n)
Create nxn matrix
count how often two words co-occur in the corpus within a fixed-size word window of size k

Co-Occurance Limitations

Frequent words
- Distributions dominated by statistically insignificant co-occurrences (example. “the” is everywhere) -> Solved by PMI
Huge Vocabulary (easily >100k words) -> 100k dim vector -> need for compression (SVD)

PMI Matrix

Pointwise Mututal Information Matrix

statistical significance of two words appearing together

pmi(x;y) = log(p(x,y/p(x)p(y))

p(x), p(y) independent probability of occurance
p(x,y) co-occurance probability

SVD

Singular Value Decomposition
Transformation of high-dimensional into low-dimensional space
Singular value: best axes to project on with minimal projection errors

SVD for Co-Occurance

Sparisity Reduction: Converts a large matrix with many zeros to a dense representation
Noise Reduction
High-order co-occurrence:
- First-order co-occurrence means that two words directly co-occur
- Higher-order co-occurrence means that two words co-occur with similar words
latent Meaning
- mapping captures the latent (hidden) meaning in the words and the contexts (embedding)

Similarity of Words

Cosine Similarity
cos() = (A*B)/|A||B|)
- A, B are embedding vectors
- similar = close to 1
- not similar = close to 0

Word Embedding + Properties

umerical representation of a word in a lower-dimensional space

similar words tend to be closer
certain algebraic operations can produce meaningful result (king+woman = queen)

Embedding Methods (name)

One Hot
Word2Vec
BERT
GPT

Embedding Evaluation

Intrinsic: Measure whether certain properties exist in embeddings
- Word Similarity correlate to human judgement
- Analogies
Extrinsic: See if embedding produces downstream NLP task

Intrinsic Evaluation Pro / Cons

Pros: fast to test, checks specifically for certain properties
Cons: no clear correlation to usefulness in downstream tasks

Extrinsic Evaluation Pro / Cons

Pros: evaluates usefulness for actual task
Cons: slower to test, harder to isolate embeddings from rest of system

Naive Embedding

Each word through the same embedding layer individually

Task specific Embeddings

Any multilayer NN with embedding layer will learn dense representations of word semantics

Difference between Word2Vec and FastText

Word2Vec multipurpose representation
FastTest trained for sentiment will learn embedings exactly for that

Pre-trained vs Random initialisation

“from scratch” -> randomly initialised
pre-initialise -> pre-training task (Word2Vec) -> Transfer Learning

Word2Vec

Idea: Words that occur together tend to have similar meanings
NN with one hidden layer
One hot encoding as input and output

Pre-Training for NLP PRerequesits

Desiredata
- task should be exceedingly difficult -> require general language understanding
- Near endless amount of training data
Ex: Language Modelling

Skip-Gram + In/Out

Technique to learn word embeddings
Developed with Word2Vec
Basic Model
- Input: A word
- Output: A Co-occuring word
Training: Fixed Sliding window over text

Skip-Gram Architecture

Two linear layers
- First layer (embedding layer) takes one-hot and projects to smaller
- Second Layer takes smaller vector (word embedding) and predicts co-occuring word
  - Projection to vocab size + softmax

Named Entitiy Recognition

Identifying and Classifying named entities of predefined categories such as people, organisations….
Goal: extraction of structured information

Skip-Gram Embeddings

Throw away second layer and only use first layers output as embeddings

Naive NER Tagging Problems

Entitiy Boundaries -> Encode entity boundaries within tag (BIO-2)
- First token of entity starts with B, all other with I

NER Tagging schemes

BIO-2
BIOES

BIOES Tagging

- B: Beginning of an entity class

- I: Inside an entity class

- E: End of an entity class

- S: Single-token entity

- O: Out (no entity)

NER Evaluation

Benchmark Dataset
Metrics
- Precision, Recall, F1-Score

Confusion Matrix in NER

TP: Correct Tag predicted
FP: Wrong Tag predicted
FN: O predicted, even tho there is an entity
TN: O token correctly predicted (easy, most tokens are O)

Entitiy vs Entity Mention

Entity: Unique object which is referenced by one or multiple entity mentions
Mention examples
- Apple Inc. is the world largest…
- The new Apple iphone X…
- -> All mention the entity Apple Inc.

FLERT

SOTA NER Tagger
Finetunes transformer on document level context

NLI

Natural Language Inference
label pairs of sentences as entailing/contradictory/neutral
Classification Task

NLI Approaches

Dual Encoder Architecture

Two Encoders for the premise and hypothesis -> Concat outputs -> Softamax Classifier
Limitation: both encoded seperately

Encode as single string

Special seperator token int he moddle
same encoder for both
Softmax Classifier on Output

GLUE

General Language Understanding Evaluation
11 dif. NLP tasks, mostly NLI

Text Sumarisation + Types

Task: Summarise one (or multiple) texts in a short paragraph

Extractive:
- Identify key passages in text and use those in summary
- Can be implemented with Sequence Labeling
Abstractive
- Summarise in new Words
- Can be implemented with Seq2Seq

Seq2Seq + Applications

family of ML approaches for NLP
- input: Text
- Output: text
Applications
- Machine Translation
- Text Summarisation
- Text/Code Generation

Seq2Seq Evaluation

BLEU Score
- Compares machine written translateion to one or multiple gold translations

BLEU Score

Compares the machine-written translation $ŷ$ to one or multiple gold translation(s)

n-gram precision
- separately compute how many n-grams (usually 1-4 grams) of ŷ are found in each gold translation
- Chose y with the highes overlap as reference
- Percentage of n-grams found in reference

RNN for Seq2Seq

Idea: use two different RNNs with seperate vocabularies
- encoder: only learns to predict a final encoder hidden state
- decoder: trained with a language modeling objective, conditioned on final encoder state
Encoder produces hidden state as input for decoder
Decode step by step and use previous prediction as input into next step

BLEU Limitations

many ways to validly translatr
good translation can get low BLEU score because of low n-gram overlap

Multi-Layer RNN for Seq2Seq

Several RNN Layers
each layer is independent RNN cell
Idea: lower RNNs compute lower-level features, and higher RNNs compute higher level features

Seq2Seq Decoding Approaches

Greedy: Step by step with no way to undo decisions
Exhaustive search decoding: Compute all possible sequences and pick the best one
Beam Search Decoding: Keep track of k most probably partial translations till stopping criterion is reached.

Beam Search Decoding

Core idea: On each step of Decoder, keep track of the k most probable partial translations (which we call hypotheses)
- k is the beam size (in practice around 5 to 10)
- Each hypothesis has score which is log probability
Stopping:
- Produce <STOP> token, put it aside and search other hypothesis
- Continue till timestep T or at least n hypthesis

Teacher Forcing

Works by using the actual or expected output from the training dataset at the current time step as input in the next time step, rather than the output generated by the network.

Raio: randomly deactivated for a percentage of decoder steps in which predictions instead of gold input are used

Information Bottleneck Problem

Can occur when Encoding of the source sentence is a single vector representation (last hidden state of RNN)
- Entire semantics of arbitrary long text must be compressed into single hidden state
- RNN Cell might struggle (limited capacity in hidde state)
- Difficult learning problem

Seq2Seq Training

Loss computed for the task “next symbol prediction” on the target language sentence
Backpropagation through entire network

RNN Limitations fro Language Modelling

RNNs pass hidden state from step to step (one direction only)
Often we want to model word in full context
Bidirectional RNN also just two glued together RNNs… no deep bidirectionality

Motivation Attention for Seq2Seq

Attention provides Solution for information bottleneck problem
Idea: do not only decode from last hidden state of encoder
Instead: additionally use direct connection to the encoder to focus on a particular part (i.e. words) of the source sequence

Attention in Seq2seq

In Seq2Seq + Attention, information flows through two “channels” from encoder to decoder:

Recurrence: Information is passed from item to item in the sequence, first encoder then decoder (red box)
Attention: Information is passed more directly via weighted sum and attention output (blue)

Attention Mechanism

NN component that helps to focus on relevant part of input data while making predictions
Instead of relying on fixed-lenght representation, model dynamicall wights and combines relevant information from different positions in the input sequence

Attention Approach

Given a sequence model calculates attention scores -> represent importatants of each position
Attention score normalised -> create distribution over the input sequence
model combines input sequence elements (embeddings) witht the corresponding attention weights

Self-Attention

Attention from one state to all states in same set
Computes contextualised representation of word given all other words in sentence

Attention Score Approaches

Basic dot-product attention
Multiplicative attention
additive attention

Benefits of Attention

Improves MT
Solved bottleneck problem
Helps with Vanishing Gradient
By inspecting attention distribution we can see what the decoder was focusing on.

Vanishing Gradients + Solutions

Occur during training of NNs, particular RNNs
when the gradients of the loss function become very small during backpropagation.
leads to very slow / effectively halting learning
arises from the repeated multiplication of gradients that are less than one, leading to exponential decay as they propagate backward through the network
Solutions
- Activation functions that maintain gradient magnitude (ReLU)
- Other Architectures: LSTMs, Transformers…

Attention Components

Query - transformed representation of input element, ususally input * weight matrix
Key - provides information ab outt eh element that cen be compared or matched against (also input * weight matrix)
Value - information or content associated with input element

Self-Attention vs RNN

Advantages

maximum interaction distance O(1)
Deeply bidirectional
All word representations can be computed in parallel

Disadvantages

Sometimes dont want Bidirectionality
Attention just does weighted averaging (no nonlinearities)
Word order is no longer encoded (its BoW)

Problems of Vanilla Self-Attention

Sometimes want no bidirectionality -> Solution: Mask future states
No non-linearities -> Add small feed-forward network between layers (linear followed by ReLu)
No positional information -> Add position representation to the inputs

Self-Attention Positional Encodings

Idea: represent each position as positional vector
Add positional vector to Value, Key and Query

Self-Attention Block (Parts)

Self Attention
Positional Encodings
Feed Forward
Residual Connection
Layer Normalisation

Residual Connections

Additional direct connections in the network that enable the gradient to bypass attention blocks

- thought to make the loss landscape smoother

Layer Normalisation

Cut down on uninformative variation in hidden vector values by normalising to unit mean and standard deviation within each layer

Transformer Idea

Only use attention in encoder + decoder but no RNN

Transformer Architecture (components)

Encoder
- Self-Attention
Decoder
- Masked Attention
- Cross-Attention
- Linear Layer +. Softmax

Softmax

Turns vector of 𝐾 real values (our model predictions) into 𝐾 probabilities that sum to 1:

S shaped curve

Self-Attention Learnable Positional Encoding +Pro/Cons

most used option
all p_i (positional vector) leanable parameters
- one hot of each position -> lean embedding for each position
Pros
- each position gets to be learned to fit data
Cons
- Cant interpolate to indices outside of defined range
  - Hard maximum sequence length

RNN

Recurrent Neural Network
Family of NNs
Allow for information to flow in cycles

Multi-Head Attention

Idea: multiple attention heads per layer
Each attention head might focus on some other “aspect” and construct value vectors differently
Just create n independent attention mechanisms and combine output (concat + linear)

RNN intuition

RNNs basically read word by word (transformer see everything at once)
But: RNNs cant go back, only forward

Recurrence in RNNs

Important to remember prior information
Output: hidden state at time t
input 1: representation of input x at time t
input 2: previous hidden state

RNN Strength

model sequences of variable length

RNN Hyperparameters

Input Sequence lenght
Neurons, Layers
Activation Function
Learning Rate
Class threshold

RNN Limits + reasons

Struggle to capture long term dependencies
- vanishing gradients
- exploding gradients

RNN vanishing gradient solutions

truncated BPTT
weigh regularisation
Gradient clipping
Other activation fucntion (ReLu)
Gated Networks (LSTM)

RNN for Sequence Labeling

Each hidden state through Linear+Softmax to classify POS Tag for example

RNN for Text Classification

Final hidden state as input for softmax classifier
Mean of all hidden states as input to softmax classifier

Exploding Gradient

Happens when large error gradient accumulate during BackProp -> excessively large updates/Steps
arises from repeated multiplication of gradients through many layers -> exponential growth
Solutions
- Gradient clipping
- Activation functions
- Architecture

Machine Translation

Seq2Seq task
translate one sentence to another language

MT Challenges

Source sentence needs to be fully understood (less forgiving then sentiment analysis)
Semantic Differences (concepts that dont exist in every language)
Underspecification - Languages differ in what can be semantically underspecified
Ambiguities - one word, multiple translations (tie)

Multiple Languages MT

earlier: one model for one type of translations
today: one model for multiple languages
- Specified with special token which language to translate to. 2

Seq2Seq Model Translation Init

init (source_dictionary, target_dictionary, source_RNN_hidden_size, target_RNN_hidden_size, embedding_size)

self.source_embeddings = Embedding(source_dictionary.size(), embedding_size)

self.source_rnn = LSTM(embedding_size, source_RNN_hidden_size)

self.target_embeddings = Embedding(target_dictionary.size(), embedding_size)

self.source_rnn = LSTM(embedding_size, target_RNN_hidden_size)

self.prediction_head = Linear(target_RNN_hidden_size, target_dictionary.size())

Language Modeling + Architecture Types

Language modeling is the task of predicting what word comes next.

Models
- Statistical Models (n-grams)
- NN-based (RNN, Transformer)

Language Modeling Aplications

MT
Text Gen
Speech Recogn.

Language Model (general Def)

System that assigns probability to a piece of text.
e.g. Given some sequence of words, what are the probabilities for all other possible next tokens

LM - Probability of Text + Example Applications

LM assigns probabilities to sequences of words
effectively modeling the likelihood of textual statements
- OCR: what token is most likely to be at a certain point
- MT: Knowledge in model to help translattion
- AutoComplete: Complete search query

LM Special Tokens

Mark e.g. end or beginning of doc / sentence
No word before, but some words more likely in beginning of doc
Introduce reserved symbold "<Start> / <Stop>

Atomic Units of Language + tradeoff

Word
Subtoken
Character

Tradeoff: Vocabulary size vs sequence length

Language Modelling Training data

Raw text enough.
Just predict the next token, given a sequence of tokens
essentially sliding window over raw text

Subtoken (idea)

Instead of modeling language as distribution over words, model as distribution over subtokens

Subtoken Level Advantages

Learns meaning of prefixes / endings (-ed, -ing)
Learns meaning of spelling variations (cool, cooool)
An unkown word might contain parts that allow for partial inference of its meaning (“guesstimate”)

Create Subtokens

No Tokenizer
Done by using heuristics

Tokensiation

Splitting a text into tokens
Input: text
output: list of tokens (each representing a word)

Tokenisation Challenges

Correctly tokenise apostrophes
Clitics: syntactially independent but phonologically dependent words
Compounds

Clitics

syntactially independent but phonologically dependent words
ex: gimme -> give me
Kontext: Tokenisation

Compounds + Ex

independent but phonologically dependent words
- open ex: “high school”, “must-have”
- closed ex: “Abwasserbehandlungsanlage”

Multi-words units +. Example

multi-words that function as a single syntactic unit

San Francisco
by and large
-> are split by tokenizer

Vocabulary

all unique words in corpus
- UNK rare words
- optionally) add special tokens like <START> and <STOP> for the beginning and end of documents
This vocabulary is used both to encode the inputs and the outputs

Character Level Models

Predict next character

Text Generation

Use generated words as input for next time step (opposite of teacher forcing)
This way, you sample one word at a time to generate text
Sample from multinomial distribution at each step (i.e. generation is non-deterministic

N-Grams + Types

continous sequences of n items froma given sample of text or speech
Types
- Unigrams
- Bigrams
- Trigrams
- Higer-order Ngrams

Word2Vec Idea

Words that are used/occur in same contexts tend to purport similar meanings
Neural Network with one hidden layer; [[One-Hot Encoding]] input and output
Distributional Representation

Word2Vec Architecture

One hidden layer NN
- two sets of weights
Input / Output one hot

Text Classifcation Evaluation Metrics

accuracy
Precision
F1

Accuracy

Fraction of predictions that are correct
Accuracy = (#correct preds/|Dataset|)

Precision

ratio of true positive predictions to the total number of positive predictions
- High: few FP errors -> important when cost of FP is high
Precision = (TP/(TP+FP))

Recall

proportion of positive examples that were correctly classified
- important when the cost of false negatives is high.
Recall = (TP/(TP+FN))

F1-Score

Balances Precision and Recall, single score that reflects both
reflects both the accuracy of positive predictions and the ability to identify all positive instances.
Useful for imbalanced datasets
Higher F1: Overall better Perf.
F1= (2 * Prec * Rec) / (Prec + Rec)

CE Loss

Measures the performance of a classification model whose output is a probability value between 0 and 1. Lower log loss indicates better performance.

POS Tagger Evaluation Metrics

Accuracy
Precision
Recall
F1

Language Modelling Evaluation Metrics

Perplexity
CE Loss
BLEU

Perplexity + Intuition

It measures how well a probabilistic model predicts a sample of text.
Perplexity is a measure of uncertainty. A lower perplexity indicates that the model is less "perplexed" by the text and has a better understanding of the language patterns.

Relation Extraction Evaluation Metric

Classification Metrics

Relation Extraction

Goal: identify semantic relationships between entity mentions

- relations can be directed

- there may be no relation (as defined by the application)

RE Approaches

Rule based: Manually create extraction features (tree-matching rule)
ML-Based: For each pair of entities in sentence make classification which (if any) relation holds)

Rule based RE Limitations

Rulesets become quickly complex / unmanagable
Rules easily overfit data

ML Based RE Process

Get all possible entity pairs
Create new sentence for each pair
- <X, Y> - ‘X’ was founded by ‘Y’ in Z
- <X,Z> - ‘X’ was founded by Y in ‘Z’
Classify generated sentences
- (founder, located-in, no-relation…)

ArgMax + use in ML

The argmax function returns the argument or arguments (arg) for the target function that returns the maximum (max) value from the target function.

In ML: most commonly used in machine learning for finding the class with the largest predicted probability.

Softmax and argmax relation

Softmax: Normalises output values (logits) to add up to 1
Argmax: returns highest value of a set of numbers
Relation: Sequential use: First Softmax to normalise, then Argmax to pick the highest one
During training Argmax not used (but loss function),
during inference, argmax
Interpretability
- Softmax: Probabilistic interpretation
- Argmax: Clear decision

Pooling Layer + Types

reduce the spatial dimensions of the input feature maps, thereby decreasing the number of parameters

Mean
Max
Min

Vector Concatenation vs Pooling

Concatenation - lenght is sum of all concat. vect
- Pro: Preserves word order
- Neg: Fixed size context only
Pooling - lenght is the same as each vect.
- Pro: Arbitrary number of vectors
- Neg: Doesnt preserve word oder

Concatenation vs Pooling for POS Tagging and Classifcation - What and why?

Text Classification - Pooling
- Why: usuallly benefits from global scope.
POS Tagging - Concatenation
- Why: specific context for each word important including word order (sequential information).

BoW Classifier

Count Vectorise Text
1 NN Layer
Activation

Why use UNK

Unkown words
Reduce Sparsity (long tail distribution)

Not using unk, most common downsides

No way to handle out of vocabulary words
Model can overfit on rare words / training not effective

BoW Classifier Conceptual Limits

Cant handle words that are not in training set
Potentially infinite set of words
Creating labels for training is expensive

WordVec vs WordNet - Scalability

Word2Vec more scalable as it is trained on raw text

Lexical representation - Scaling

Very hard as created manually
ex: WordNet

Distributional Representaion- Sclaing

Needs near infinite data
lot of computational ressources

WordNet vs Word2Vec - Ambiguity

Wordnet: Can model the fact that words have multiple meanings through lemmas and lexmes
Word2Vec - cannot model the fact that words have multiple meanings

WordNet vs Word2Vec - Word Similarity

WordNet: Walks along tree (counting steps)
Word2Vec: Distance in highdimensional space. Either euclidean or most commonly cosine distance

Word Similarity with Embeddings

Can view embedding as point in space
use distance metric as measure of similarity
- Euclidean
- Cosine

Why does FastText use N-Grams + Ex

Its BoW, cannot distinguish between
- Not bad at all - its good
- Not good at all - its bad.

FastText Init Pseudocode

NLP Pipeline Steps

Text Splitting
Tokenisation
POS-Tagging
Lemmatization
Dependency Parsing

Sentence parsing + Approaches

Segmenting raw text to sentences
Approaches
- Rule Based (punctuatione etc.)
- Prediction based (binary classifier for punctuation symbol)

Lemmatisation + Ex

Lemmatisation determines the dictionary form of a token
- mice -> mouse
- cars -> car
- found -> find

Morphological Features + Examples

Consists of three levels of representation (ex: “a solution was found”)
- Lemma (dictionary form)
  - find
- POS-Tag
  - VERB
- set of features (lexical / grammatical Properties)
  - past tense, passive voice

Lexical and Grammatical properties + Ex

grammatical Number (“mice” plural of “mouse”
gram. gender (“pilotin” is fem. of “pilot”)
gram. voice (was “found” is passive of “find”)
gram. tense (“searched” is past tense of “search)

Combine Encodings of Word and POS Tag

One hot +Embed word
One hot + Embed POS Tag
Concatenate both -> use as input

Vauquois Triangle

Concept how much analysis to perform on source text
Levels
- Direct: Text to text
- Transfer Method: Add some syntax
- Interlingua: Fully analyze all semantics of source sentence

Vanilla RNN Mathematical Def.

What does LSTM solve?

introduce a memory cell that can selectively remember or forget information over time.
Makes it easier to preserve information over steps -> Solves vanishing gradients

LSTM Activation function for gating + Why

Forgetting: Sigmoid
- Produes vector between 0-1
Updating
- Tanh - over current input + last hidden state (what to write
- Sigmoid - over current input + last hidden state (how much)
Read

RNN for Text Classification

Either use last state in Softmax Classifier
Pool over all hidden states and then use softmax classifier

RNN vs Self-Attention conceptual disadvantages

Information Bottleneck
No parallelisation (scaling issues)
No bidirectionality

Text generation Strategies

Sampling
Greedy Decoding
Beam Search
Exhaustive Search

Temperature parameter

For sampling text generation strategy
Controls randomness in sampling
Higher Temp: Model explore more randomly (can lead to more “creative” generations)
Lower Temp: Model exploits more own knowledge, chooses most likely words more consistently

Beam Search Stopping Criteria

until timestep T reached
or n completed hypotheses

Beam Search Decoding Parameters

k - beam size (how many partial translation to keep track of)
t / n - cutoff, either timestep r number of finished hypotheses

BLEU Score

Compares the machine-written translation $ŷ$ to one or multiple gold translation

n-Gram Precision
- seperately computes how many n-grams of pred are found in each gold translation
- chose y with most overlap.
- -> percentage of ngrams found in refrence translation

BLEU Limitations

Many ways to translate
good translation can have bad BLEU score due to low overlap

Cross-Attention Benefit

helps in understanding the context between different sequences. For example, in translation, it allows the model to align the source and target sentences effectively.

Cross attention - keys, values, queries

cross-attention uses the encoder states as keys and values, and the decoder states as queries.

Learnable Layers in attention head

Query Projection
Key Projection
Value Projection
Output Projection

Masked attention

restricting a transformer to only see a certain part of an input sequence

Masked Attention usecases

When transformer should only have access to past and present context, not future
a mask is applied to the attention scores to prevent the model from attending to certain positions in the sequence
Examples
- Language Modelling
- MT
- Sequence Generation

MLM vs CLM

Context
- MLM - Bidirectional
- CLM - unidirectional
Training Tokens
- MLM- random masked token
- CLM - next token
Architecture
- MLM - Bidirectional Transformer (BERT)
- CLM - Autoregressive trans. (GPT)
Application
- MLM - Not in sequence generation
- CLM - Sequence generation

MLM vs CLM for Sentence level semantics

MLM
- Bidirectional Context
- Global Context (leverages entire sentence to predict token)

NSP objective

objective used during the pre-training of models like BERT (MLM)
- It helps the model understand the relationship between pairs of sentences
During training, the model is given pairs of sentences.
- The task is to predict whether the second sentence in the pair is the actual next sentence in the original document (labeled as "IsNext") or a random sentence from the corpus (labeled as "NotNext").
Positive Example (IsNext):
- Sentence A: "The cat sat on the mat."
- Sentence B: "It was looking at the sun outside the window."
- Label: IsNext (1)

Transformer Word Embedding Advantages

Description - Word is described using its context
Disambiguation - can capture different meanings of a word based on its context (word2vec: 1 word = 1 embedding)
Rich Semantic Information - incorporate information from the entire sentence
Ability to fine tune - Word2Vec embeddings are static

Word2Vec vs Transformer Embeddings

Static vs Dynamic
- Word2Vec has one representation for one word regardless of context
Finetuning
- Word2Vec trained on predicting next word, no way to contextually finetune

Information Extraction Pipeline steps

Entity Linking

Link entity mentions to entities from a knowledge base

Entitiy Linking Challenges

Ambiguity
- Chicago has 87 entities on wikipedia
Synonyms
- Name Variations, Nicknames, Spelling variations
Novel entities
- Cold start problem

IE Motivation

Info often not available in structured from
Info changes quickly (manual creation unfeasible)

Inter-Annotator Agreement

Crowdsourcing Setup

Qualification stage - Small set with known ansers
Inject samples with known answers -> Reject annotators if they ge these wring
Redundancy -> 5 annotations per data point
- Majority vote
- Statistical model
- Experience counts

Weak Supervision. + Types

Multiple different weak labeling functions

Different knowledge bases
Models trained on related domains
Heuristic functions

distant supervision +. Process

Idea: existing knowledge bases give us a large list of examples for entities and relations
Process
- Lookup stringh “Apple” in KB
- Derive NER Type from KB entity
- Attach as label to span

Weak Supervision Aggregation Ways

Majority vote
expectation Maximisation

LLMs as Teacher + Problem

Concept of generating labeled trainings samples through an LLM

Problem:
- Quality of data is prompt dependent
- Lack of creativity
- Accuracy of data

Improve Annotation Quality

Train annotators
2 independent annotations per data sample
Calcule inter-annotator agreement
Potentially exclude annotators

Label Noise Sources

Human error - main source, just labeled incorrectly
Poor data - Data of poor quality (blurry image)
HUman label variation - Multiple possible answers / ambiguity
Semi-Automatic annotation - When model picks up wrong singnal (emoji)

Noise Robust Learning

Accept that some part of the training data will be incorrectly labeled
Design machine learning approach to handle

Noise Robust Learning Approaches

Delay Memorisation
- Delay overfitting or stop training before it begins
Clean Subset
- Filter suspicious data points from datasets
Additional Clean Data
- Small set of additionally expert labeled data
All approaches struggle on real noise, filtering noise seems to be most promising

Problem with Noise Simulation

Simulated noise used to test noise robust learnign
But noise not realisitc.

NoiseBench

Evaluate 6 types of noise

NoiseBench Noise Splits

Expert - Original CoNLL-03
Crowd - cheap + non expert
auto - weak + distant supervisino
LLM

Beitreten

Vorschau

Author

Jannic H.

Informationen

Zuletzt geändert
vor 2 Monaten

Kurs melden