Name the analysis levels for data cleansing
Phonetics and Phonology -> Bestandteile des Worts und deren Aussprache
Segmentation -> Unterteilung der Wörter -> Tokenize
Morphology -> Segment
Syntax
Semantics
Pragmatics and Discourse
Name examples for tokenization ambiguities
Period: Final sentence punctuation or part of a number or abbreviation
Whitespace character: part of a number or name (New York)
Comma: Number
Single quotes: mark contractions and elisions (Don’t), enclosing quoted groups of words
Dash: delimiter, part of the token (pages 100-101) (multi-word)
What is the word stem?
Minimal free morphems
for cats the stem is cat
-> Stems carry the main meaning
What is an affix?
Affixes are bound morphemes (s in cats)
Name 4 Types of Affixes
Suffixes -> After the base
Prefixes -> Before the base
Infixes -> Inside the base
Circumfixes -> On both sides of the base
Explain Stimming
Algorithmic approach for stripping off the endings of words.
e.g. sitting -> sitt
Objective: Transform words belonging to the same morphological family into the same stemmed representation
Name 2 Stemming Errors
Under-stemming: Remove to little
Over-stemming: Remove to much
Problem with stemming
Homographs / Syntactic ambiguity
-> Words which have the same spelling but different meanings
What is Lemmatization
“undo” the inflectional changes of a base form
e.g. cats (NOUN) -> cat
saw (VERB) -> see
Explain POS Tagging
Process of assigning a POS or lexical class marker to each word in a corpus.
Example English has 8 POS: Noun, Verb, Adjective, Adverb, Preposition, Pronoun, Determiner, Interjections
Explain Parsing
Process of determining the grammatical structure with respect to a given grammar
Represented in a parsing tree
Name the typical processing steps (3)
Tokenization for splitting texts into tokens
Stemming / Lemmatization to normalize tokens
PoS-Tagging and parsing analyze syntactic features
PoS-tags roughly represent word classes
Phrases group words to function as a single unit
What is tokenization?
Segmentation of an input string into an order of sequences of units (Including punctuation)
e.g.
John likes Mary. -> {“John”, ”likes”, “Mary”, “.”}
Zuletzt geändertvor einem Monat