ANN using PQ on multivector

Using an Approximate Nearest Neighbor (ANN) index, such as one built with product quantization (PQ), introduces an approximation compared to the exact (or native) maxsim computation. In other words, there is a trade-off between accuracy and speed—some precision is sacrificed for significantly faster search speeds. Let's break this down mathematically.

Native Maxsim Computation

For a query multivector $q$ and a stored item’s multivector $x$ , the native similarity is defined as:

maxsim (q, x) = max_{i} sim (q_{i}, x_{i})

where the cosine similarity between two vectors $q_{i}$ and $x_{i}$ is computed exactly as:

sim (q_{i}, x_{i}) = \frac{q_{i} \cdot x_{i}}{∥ q_{i} ∥ ∥ x_{i} ∥}

This computation is exact, meaning there is no approximation at this stage.

Approximate Similarity via PQ-Based Indexing

In many ANN methods, including those based on product quantization, each vector is split into $M$ sub-vectors. A vector $x$ can be written as:

x = [x_{1}, x_{2}, . . ., x_{M}]

Each sub-vector $x_{m}$ is then approximated by a centroid from a learned codebook:

x_{m} \approx c_{m}

where $c_{m}$ is the centroid approximating the $m$ -th sub-vector of $x$ . As a result, the cosine similarity between a query $q$ and a stored vector $x$ is approximated by:

sim (q_{m}, x_{m}) \approx sim (q_{m}, c_{m})

When computing the aggregated multivector similarity using indexing, we get:

maxsim (q, x) \approx max_{i} sim (q_{i}, c_{i})

Since each sub-vector is approximated, an error is introduced:

e_{m} = sim (q_{m}, x_{m}) - sim (q_{m}, c_{m})

where $e_{m}$ represents the quantization error for each $(q_{m}, x_{m})$ pair.

In the worst case, if each sub-vector approximation introduces a small error $e_{m}$ , then over $M$ sub-vectors, the total error may accumulate to an order of approximately:

\sum_{m = 1}^{M} e_{m}

or, more realistically, a sum of errors that does not fully cancel out. Thus, the total approximation error for one cosine similarity computation is:

E = O (M \cdot e_{m})

where $E$ depends on the quantization precision and the number of sub-vectors used.

Aggregating over the multivector, the approximate maxsim becomes:

maxsim (q, x) \approx max_{i} (sim (q_{i}, c_{i}) + e_{i})

where $e_{i}$ represents the effective error after taking the maximum over the approximated cosine similarities.

Trade-off: Accuracy vs. Speed

There is a fundamental trade-off between computational accuracy and efficiency:

Exact (Native) Calculation:
The cosine similarities are computed exactly, making the results precise but computationally expensive for large datasets.
ANN Indexing (PQ-Based Computation):
The similarity is computed using quantized representations, significantly speeding up the search but introducing an error term $e_{m}$ in each similarity computation.

Mathematically, while:

sim (q_{m}, x_{m}) \neq sim (q_{m}, c_{m})

the error introduced is typically small relative to the differences in similarity scores between truly similar and dissimilar items. In practical retrieval scenarios, the most important factor is maintaining the correct ranking of items rather than the exact similarity values. Even if the absolute similarity values are slightly shifted or compressed (resulting in a smaller dynamic range), as long as the relative order is preserved, the ANN approach remains effective.

Conclusion

Using an ANN index like PQ in LanceDB does introduce an approximation:

Mathematically: The cosine similarities are replaced by their quantized approximations:
$sim (q_{m}, x_{m}) \approx sim (q_{m}, c_{m})$
leading to an aggregated maxsim computation:
$maxsim (q, x) \approx max_{i} (sim (q_{i}, c_{i}) + e_{i})$
Implication: While the exact similarity values differ from the native computation, the relative ranking of results is generally preserved. This trade-off is a common and accepted compromise in ANN search systems, where efficiency is prioritized over minor precision losses.

My note

This is mathematical process, but not the real world. In real world, we might have this question:

Is it possible for the error to be nearly zero in this case? Since the key operation here involves MaxSim, which finds the maximum dot product for each query and sums the maximum values to compute the final score, the situation differs from typical approximate searches. If the ANN method consistently identifies the exact maximum dot product for each query, the approximation would still yield an accurate final score with minimal loss. This should, theoretically, result in much higher accuracy compared to the usual error associated with approximate cosine similarity searches.

The the short answer is:

Yes, if your ANN indexing method (e.g., using PQ) almost always returns the exact maximum candidate for each query vector, then the aggregated maxsim score will be very close to the native (exact) score. In this scenario, the error introduced by the approximation is nearly zero—especially when compared to methods that do not use the max operation or when the ANN method has lower recall.

Check details if you want to know why: Is it possible for the ANN indexing and MaxSim's error to be nearly zero in this case?