What is Dense Retrieval, and how does it differ from traditional BM25?
Dense Retrieval (DR) replaces traditional inverted index-based retrieval (e.g., BM25) with neural embeddings.
Differences from BM25:
Uses vector representations instead of exact term matching.
Improves retrieval for semantic similarity rather than keyword overlap.
What are 3 neural approaches for improving first-stage retrieval?
Neural methods for document expansion & indexing:
Doc2Query → Expands documents with queries that match them semantically (BERT/T5 variants).
DeepCT → Assigns term weights using BERT output during indexing.
COIL → Fuses contextual vectors into an inverted index structure for faster lookup.
How does neural re-ranking differ from dense retrieval?
Neural Re-Ranking:
Reorders a pre-selected list of documents.
Uses score(q,d) interface (like classical ranking).
Dense Retrieval:
Replaces the traditional first stage.
Uses neural encoder & nearest neighbor vector indexto rank documents.
When can dense retrieval be used without re-ranking?
If dense retrieval is effective enough, it can be used alone without a re-ranking step.
Benefits:
Faster retrieval (removes additional ranking stage).
Lower complexity in production search systems.
How are dense retrieval models trained?
Training process:
Triplet-based training → Uses (query, relevant document, non-relevant document).
Generate embeddings → Generate dense vector representations for queries and documents (from triplet).
Loss function → Maximize margin between relevant & non-relevant docs.
How are training batches formed for dense retrieval models?
Batch formation:
Sample as many triplets as allowed by GPU memory.
Common batch sizes: 16-128.
Batches mix different queries for better generalization.
How are non-relevant passages selected for training?
Selection process:
Run BM25 to get the top 1000 results for each training query.
Randomly select some of those as negative (non-relevant) passages.
The non-relevant selections provide some signal (There must be at least some lexical overlap)
These negatives add realism to training but may contain noise (which is useful)
What loss functions are commonly used in dense retrieval training?
Two main loss functions:
Plain Margin Loss → Maximizes separation between relevant & non-relevant docs. loss = max(0, s_nonrel - s_rel +1)
RankNet Loss → A pairwise ranking loss assuming binary relevance. -> loss = BinaryCrossEntropy(s_rel - s_nonrel)
Both assume binary relevance
-> Aim to maximize margin between relevant & non-relevant documents
What are the 3 main phases in the dense retrieval lifecycle?
Step 1 - Model Training → Learn query/document embeddings.
Step 2 - Indexing → Encode all passages into dense vectors and store them.
Step 3 - Searching → Use nearest neighbor search to find relevant passages.
How does the BERT_dot model work in dense retrieval (Step 1 - Training)?
BERT_dot compresses query & passages into a single vector:
Each passage is encoded only once during indexing.
At query time, only the query embedding needs to be computed.
Relevance is scored using a dot-product between query & passage vectors.
✅ This allows fast lookup using a nearest neighbor index.
Encoding and Matching Formulars for BERT_dot
How does nearest neighbor search work in dense retrieval (BERT_dot Step 2 &3)?
Process:
Encode all passages in the collection
Save them in a nearest neighbor index.
At query time/search, encode query on the fly and retrieve the nearest neighbot vectors from passage index.
Why is GPU brute-force search effective for dense retrieval?
GPU-based search can efficiently compute the top 1000 results from 9 million vectors using fast dot-product calculations.
✅ Advantage: Extremely fast for smaller datasets.
🚫 Limitation: Becomes costly for very large-scale retrieval.
What are 4 common indexing techniques used in dense retrieval?
Flat Index → Brute force search over all embeddings (best accuracy but slowest).
Inverted File Index → Clusters documents, reducing search space.
Product Quantization → Compresses vectors into lower-dimensional integer representations.
Graph-Based Indices → Builds a graph where similar vectors are linked for fast traversal. Search space is split into hierarchical layers
Approximate Nearest Neighbor search -> Get a good nearest neighbor but maybe not the neares (Tadeoff between latency and effectiveness)
Explain Flat index and its advantages / disadvates
USe the raw vector embeddings and brute force search
+ Exaustive search -> best accurecy
- Calculate distance for each pair -> slow
Explain Inverted File Index and its advantage / disadvantage
Partition the data set into clusters using clustering algo (like k-means) and compute centroids of each cluster.
Compute similarity between query and centroids and select top n clusters and compute similarity to all vectors of the cluster
+ Large reduction of search space
- More overhead
Explain Product Quantization and its advantages / disadvantages
Replace original vectors of floats with lower dimensional vector of integers
+ Memory efficient & fast
- Results are approximate & Quality depends of split and clustering parameters
Explain Graph Indices and its advantages / disadvantages
Proximity graph, vectors are linked with similar “friends”
Search starts at predefined entry points and vsit friends until no nearer vertex is found. Then go one layer down and continue search
+Scalse much better to huge data sets
- Need additional pre-processing & memory
Explain Approximate Neares Neighbor Search and its advantes / disadvantages
Search for a close neighbor (but not the optimum) -> Tradeoff between Latency and effectiveness
+ Faster than brute force search, Scales better to massive datasets. -> Necessary for low-latency CPU serving
- Adds a lot of complexity to search system
What is Approximate Nearest Neighbor Search (ANN), and why is it used?
ANN trades off accuracy for speed by reducing search complexity.
Advantages:
Faster than brute force search.
Scales better to massive datasets.
Disadvantage:
Retrieval quality depends on index structure and approximation method.
Last changed3 months ago