ColPali MaxSim vs. LanceDB Indexing
1. How ColPali Native MaxSim Works
ColPali employs a MaxSim retrieval approach. Each query is represented as a set of sub-vectors
This brute-force method requires computing all pairwise cosine similarities (
2. How LanceDB MaxSim Works
LanceDB optimizes MaxSim retrieval using multi-vector storage and approximate nearest neighbor (ANN) search. Documents are stored in a structured format where each row contains a multi-vector column (e.g., a list of 256-dimensional sub-vectors). The system builds an HNSW (Hierarchical Navigable Small World) index, which precomputes connections between vectors to enable efficient traversal.
Data Schema:
pa.field("vector", pa.list_(pa.list_(pa.float32(), 256)))
During retrieval, LanceDB avoids exhaustive comparisons. For each query sub-vector
Note: In LanceDB, they currently using ANN in Product Quantization (PQ), here is the detail math for our case: ANN using PQ on multivector
3. Comparison of ColPali and LanceDB
ColPali’s brute-force approach stores multi-vectors in memory and computes all pairwise similarities, resulting in
Both methods produce identical or nearly identical MaxSim scores because LanceDB’s ANN prioritizes finding the closest matches. Errors in lower-ranked candidates do not affect the final score, as only the maximum similarity per query sub-vector matters. The trade-off is minimal: LanceDB sacrifices exactness in intermediate steps but preserves accuracy in the final aggregation.
Feature | ColPali (Brute-Force) | LanceDB (ANN-based) |
---|---|---|
Storage | Stores multi-vectors in memory | Stores multi-vectors in LanceDB table |
Computation | Computes all pairwise similarities | Uses ANN to retrieve only closest vectors |
Efficiency | ||
MaxSim Aggregation | Explicitly sums max-sim for each query sub-vector | Uses ANN + max-sim aggregation |
Final Score | Sum of max cosine similarities | Same sum of max cosine similarities |
Speed | Slow for large datasets | Much faster for large datasets |
Accuracy | Exact | Nearly exact (often identical to brute-force) |
4. Efficiency and Accuracy of LanceDB’s Indexing
LanceDB’s speed advantage stems from the HNSW algorithm. HNSW constructs a hierarchical graph where top layers enable coarse, rapid navigation, and lower layers refine the search. During queries, the algorithm starts at the top layer, greedily traversing toward the nearest neighbors, bypassing irrelevant vectors.
This structure ensures that only
5. Mathematical Formulation of MaxSim
The brute-force MaxSim score is defined as:
where
5.2 MaxSim with ANN (LanceDB)
LanceDB’s ANN-based approach approximates this by limiting comparisons to the top-K candidates per query sub-vector:
Since the top-K set almost always includes the true maximum similarity (due to HNSW’s design), the aggregated score matches the brute-force result.
Note: How HNSW works?
Instead of computing all similarities:
1. ANN retrieves top-K nearest neighbors per query sub-vector:
2. Only compute cosine similarity for those K candidates:
3. Final MaxSim score:
4. Since ANN finds the best match with high probability, the result is the same.
6. Numerical Example
Consider a query
6.2 Brute-Force Calculation
Brute-force calculation:
- For
, the maximum similarity is . - For
, the maximum similarity is .
Max | ||||
---|---|---|---|---|
0.99 | 0.48 | 0.48 | 0.99 | |
0.42 | 0.64 | 0.96 | 0.96 |
Final MaxSim Score:
LanceDB’s ANN search:
- HNSW retrieves
for and for , yielding the same score of 1.95.
Both methods produce identical results, but LanceDB avoids computing all 6 pairwise similarities.
Key Takeaways
ColPali’s brute-force MaxSim is accurate but computationally prohibitive for large datasets. LanceDB addresses this with ANN-based retrieval, reducing complexity from