ColPali MaxSim vs. LanceDB Indexing

1. How ColPali Native MaxSim Works

ColPali employs a MaxSim retrieval approach. Each query is represented as a set of sub-vectors Q={q1,q2,,qM}, and each document is similarly decomposed into sub-vectors D={d1,d2,,dN}. The scoring function calculates the sum of the maximum cosine similarities between each query sub-vector and the most similar document sub-vector:

score(Q,D)=i=1Mmax1jNcos_sim(qi,dj).

This brute-force method requires computing all pairwise cosine similarities (O(MN) complexity). For each query sub-vector qi, the system identifies the document sub-vector dj with the highest similarity and aggregates these maxima. While straightforward, this approach becomes computationally expensive for large datasets due to its quadratic scaling.

2. How LanceDB MaxSim Works

LanceDB optimizes MaxSim retrieval using multi-vector storage and approximate nearest neighbor (ANN) search. Documents are stored in a structured format where each row contains a multi-vector column (e.g., a list of 256-dimensional sub-vectors). The system builds an HNSW (Hierarchical Navigable Small World) index, which precomputes connections between vectors to enable efficient traversal.

Data Schema:

pa.field("vector", pa.list_(pa.list_(pa.float32(), 256)))

During retrieval, LanceDB avoids exhaustive comparisons. For each query sub-vector qi, the HNSW index retrieves the top-K nearest document sub-vectors. Cosine similarities are computed only for these candidates, and the maximum similarity per query sub-vector is aggregated. This reduces the computational complexity to O(MlogN), as the search space is pruned using the ANN index. Despite the approximation, the final MaxSim score often matches the brute-force result because the method prioritizes retaining the highest similarity pairs.

Note: In LanceDB, they currently using ANN in Product Quantization (PQ), here is the detail math for our case: ANN using PQ on multivector

3. Comparison of ColPali and LanceDB

ColPali’s brute-force approach stores multi-vectors in memory and computes all pairwise similarities, resulting in O(MN) complexity. This makes it impractical for large-scale applications. In contrast, LanceDB stores multi-vectors in a dedicated table and uses ANN to limit comparisons to the most promising candidates. Its O(MlogN) complexity ensures scalability.

Both methods produce identical or nearly identical MaxSim scores because LanceDB’s ANN prioritizes finding the closest matches. Errors in lower-ranked candidates do not affect the final score, as only the maximum similarity per query sub-vector matters. The trade-off is minimal: LanceDB sacrifices exactness in intermediate steps but preserves accuracy in the final aggregation.

Feature ColPali (Brute-Force) LanceDB (ANN-based)
Storage Stores multi-vectors in memory Stores multi-vectors in LanceDB table
Computation Computes all pairwise similarities Uses ANN to retrieve only closest vectors
Efficiency O(MN) complexity (slow) O(MlogN) complexity (fast)
MaxSim Aggregation Explicitly sums max-sim for each query sub-vector Uses ANN + max-sim aggregation
Final Score Sum of max cosine similarities Same sum of max cosine similarities
Speed Slow for large datasets Much faster for large datasets
Accuracy Exact Nearly exact (often identical to brute-force)

4. Efficiency and Accuracy of LanceDB’s Indexing

LanceDB’s speed advantage stems from the HNSW algorithm. HNSW constructs a hierarchical graph where top layers enable coarse, rapid navigation, and lower layers refine the search. During queries, the algorithm starts at the top layer, greedily traversing toward the nearest neighbors, bypassing irrelevant vectors.

This structure ensures that only KN similarities are computed per query sub-vector. Accuracy is maintained because MaxSim depends critically on identifying the single best match for each qi. Empirical results show that HNSW reliably retrieves these top candidates, especially when the index is well-tuned. The hierarchical design minimizes the risk of missing high-similarity pairs, ensuring the final score remains accurate.

5. Mathematical Formulation of MaxSim

The brute-force MaxSim score is defined as:

score(Q,D)=i=1Mmax1jNcos_sim(qi,dj),

where cos_sim(a,b)=abab. This requires M×N similarity computations.

5.2 MaxSim with ANN (LanceDB)

LanceDB’s ANN-based approach approximates this by limiting comparisons to the top-K candidates per query sub-vector:

score(Q,D)=i=1Mmaxdjtop-Kcos_sim(qi,dj).

Since the top-K set almost always includes the true maximum similarity (due to HNSW’s design), the aggregated score matches the brute-force result.

Note: How HNSW works?

Instead of computing all similarities:

1. ANN retrieves top-K nearest neighbors per query sub-vector:

{djdjtop-K nearest neighbors of qi}

2. Only compute cosine similarity for those K candidates:

maxdjtop-Kcos_sim(qi,dj)

3. Final MaxSim score:

i=1Mmaxdjtop-Kcos_sim(qi,dj)

4. Since ANN finds the best match with high probability, the result is the same.

6. Numerical Example

Consider a query Q={q1=(0.8,0.6,0),q2=(0,0.6,0.8)} and a document with sub-vectors D={d1=(0.7,0.7,0),d2=(0.6,0,0.8),d3=(0,0.8,0.6)}.

6.2 Brute-Force Calculation

Brute-force calculation:

qi vs dj d1 d2 d3 Max
q1 0.99 0.48 0.48 0.99
q2 0.42 0.64 0.96 0.96

Final MaxSim Score:

0.99+0.96=1.95

LanceDB’s ANN search:

Both methods produce identical results, but LanceDB avoids computing all 6 pairwise similarities.

Key Takeaways

ColPali’s brute-force MaxSim is accurate but computationally prohibitive for large datasets. LanceDB addresses this with ANN-based retrieval, reducing complexity from O(MN) to O(MlogN) while preserving accuracy. By focusing on the highest similarities and leveraging HNSW’s hierarchical graph, LanceDB achieves scalability without compromising the final score. This makes it suitable for real-world applications where both speed and precision are critical.