02. Foundations of Text Classification

by Jo J.

Name 3 approaches for text classification

Pro and con of rule-based ML

+ Precision can be high, human-readable rules

- Very expensive to build and maintain

What are the pros and cons of supervised learning for text classification?

Pros: More accurate, easier to maintain than rule-based systems.

Cons: Requires labeled training data.

Formular Bayes Rule

Formular Naive Bayes

What is a limitation of standard classification and how can we overcome these limitations?

Limitation: Assumption that indivisual cases are disconnected and independent

-> Hidden Markov Models

Explain Sequence labeling

Each token in a sequence is assigned a label
Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors
Example: POS

2 Problems of POS

Not easy to integrate information from category of tokens on both sides
Difficult to propagate uncertainty between decisions and “collectively” determine the most likely joint assignment of categories to all of the tokens in a sequence

4 Key features of the HMM

Fixed set of states
State transition probabilities
Fixed set of possible outputs
For each state: a distribution of probabilities for every possible output -> Emission probabilities

Formullar for NLP Transition Probability and Emission probabilitites

Transition Probabilities