How can we obtain a trigram model for a language? Explain the probability distribution involved.
We need a corpus of words over L. Then we count how often each trigram occurs in it and use that to estimate the probability distribution P(T = t) of trigrams t
Explain informally how we can use trigram models to identify the language of a document D.
We build a trigram model for each candidate language. Then we use each model to compute the probability of D occurring in that language. We choose the language with the highest probability.
Explain briefly what named entity recognition is.
The task of finding, in a text, names of things and deciding what class they belong to.
how many triagrams does a language with x words have?
x^3
applications of trigram models
language identification
genre classification
named entity recognition
give the tf(_,d2)
tf = term frequency
give idf(_, D)
(die 3 ist falsch, müsste 4 sein)
idf = inverse document frequency)
What is the idea of cosine similarity for comparing a query against the documents in D?
The query and each document are represented as a vector representing word frequencies. Vectors pointing in the same directions are considered similar. So the documents can be ranked by the angle between them and the query.
What is the benefit of using tf idf instead of tf for using cosine similarity?
tf idf gives more weight to words that occur in fewer documents. Otherwise, many documents would falsely appear similar just because the most common words appear in most of them.
What is a stistical language model?
A probability distribution over words or n-grams occurring in a corpus of the language
Last changed8 months ago