What is the goal of Information Retrieval (IR)?
The goal of IR is to determine the relevance of a document in terms of a specific topic.
Def. Document
· Can be anything, web page, word file, text file etc.
Def. Collection
A set of documents
Def. Relevance
Does a document satisfy the information need of the user or does it help complete the user’s task?
Document frequence:
How many documents contain the term
Term frequence per document
how often does the term appear per document
What are the four main types of queries in IR?
Exact matching (concatenated with or)
Boolean queries (and / or / not operators)
Expanded queries (incorporate synonyms)
Wildcard queries, phonetic queries, phrase queries
What is an inverted index, and why is it useful?
An inverted index allows efficient document retrieval by storing term statistics such as document frequency and term frequency. It avoids reading full documents and enables fast ranking of relevant results.
What are the key components of an inverted index?
Document Frequency (df) – Number of documents containing a term.
Term Frequency (tf) – Number of times a term appears in a document.
Posting List – A list of document IDs where the term appears.
Document Length – Length of each document.
Explain the functionality of a dictionary
The Dictionary <T> maps text to T
T is a posting list or potentially other data about the term depending on the index
What are the 3 wanted properties for dictionaries
Random lookup
Fast (creation & lookup
Memory efficient
What are relevance limitations
Relevance to the need rather than to the query
Query is a shorthand for an instance of information need, its initial verbalized presentation by the user
Relevance is assumed to be a binary attribute
A document is either relevant to a query / need or it is not
What is term frequency
Number of occurences of term t in document d
Why is TF-IDF useful in ranking search results?
TF-IDF helps rank documents by assigning higher scores to rare terms that appear frequently in a document, making them more relevant than common words.
What is BM25, and how does it improve TF-IDF?
BM25 improves TF-IDF by introducing term frequency saturation and document length normalization, making the retrieval process more effective for long queries.
What is the difference between precision and recall?
Precision measures how many retrieved documents are actually relevant.
Recall measures how many relevant documents were retrieved out of all possible relevant documents.
How does Mean Reciprocal Rank (MRR) evaluate ranking quality?
MRR focuses on the position of the first relevant document in search results. It calculates the reciprocal of its rank and averages it over multiple queries.
What makes nDCG a better ranking evaluation metric than precision?
Unlike precision, nDCG considers different levels of relevance (perfect, highly relevant, relevant, irrelevant) and discounts relevance based on document position.
What are the main challenges in evaluating Information Retrieval systems?
Defining Relevance – Different users may perceive relevance differently.
Handling Ambiguity – Queries may have multiple interpretations.
Limited Judgment Data – Not all documents can be manually assessed.
Computational Cost – Some evaluation metrics are expensive to compute.
What is the key difference between TF-IDF and BM25?
BM25 modifies TF-IDF by introducing term frequency saturation and document length normalization, improving retrieval effectiveness for long documents and queries.
Last changed3 months ago