Inverted File Index (IVF)

Let’s explore IVF in depth, using our toy dataset to illustrate how it reduces search space while maintaining accuracy.

1. The Problem: Why IVF?

Imagine you’re searching for a book in a library with 1 million books. Brute-force would mean scanning every shelf. Instead, you:

IVF does this for vectors. Let’s formalize it.

Objective: Partition the dataset into $K$ clusters (sections) to avoid scanning all vectors.

K-means Formulation:
Given $n$ vectors $X = {x_{1}, . . ., x_{n}} \subset R^{d}$ , find $K$ centroids ${μ_{1}, . . ., μ_{K}}$ that minimize:
$min_{{μ_{k}}} \sum_{k = 1}^{K} \sum_{x \in C_{k}} ∥ x - μ_{k} ∥^{2}$
where $C_{k}$ is the $k$ -th cluster.
Example:
For our toy dataset ( $K = 3$ ):
- Cluster 1 (Red): $[1.1, 0.9]$ , $[1.0, 1.2]$ → Centroid $μ_{1} = [1.05, 1.05]$ .
- Cluster 2 (Green): $[5.2, 5.0]$ , $[5.1, 4.8]$ , $[5.3, 5.1]$ → Centroid $μ_{2} = [5.2, 4.97]$ .
- Cluster 3 (Blue): $[9.0, 1.0]$ , $[9.5, 0.8]$ , $[8.9, 1.2]$ → Centroid $μ_{3} = [9.13, 1.0]$ .

For each centroid $μ_{k}$ , maintain a list of vectors in its cluster:

For a query $q = [5.4, 5.2]$ :

Find Nearest Centroids:
Compute distances to all $K$ centroids: $∥ q - μ_{1} ∥^{2} = 18.2, ∥ q - μ_{2} ∥^{2} = 0.07, ∥ q - μ_{3} ∥^{2} = 16.5$
Probe Clusters:
Select the closest $n_{probe} = 1$ cluster (Cluster 2).
Search Only in Cluster 2:
Compare $q$ to vectors ${[5.2, 5.0], [5.1, 4.8], [5.3, 5.1]}$ .

Parameter	Effect	Example
`num_partitions` ( $K$ )	Higher $K$ → Smaller clusters, lower IVF error.	$K = 3$ partitions data into 3 clusters.
`nprobe`	Higher $n_{probe}$ → More clusters searched, higher recall.	$n_{probe} = 1$ probes 1 cluster.

The probability of missing the true nearest neighbor is bounded by:

P_{miss} \propto \exp (- \frac{n_{probe}}{K})

Example: For $K = 256$ , $n_{probe} = 20$ : $P_{miss} \propto \exp (- \frac{20}{256}) \approx 0.92$ This means a ~8% chance of missing the true neighbor, but search speed improves by 12.8x (since $\frac{K}{n_{probe}} = \frac{256}{20} = 12.8$ ).

Clustering:
- Input: 8 vectors in $R^{2}$ .
- Output: 3 clusters with centroids $μ_{1}$ , $μ_{2}$ , $μ_{3}$ .
Query $q = [5.4, 5.2]$ :
- Step 1: Compare $q$ to centroids.
- Step 2: Narrow search to Cluster 2 (green).
- Step 3: Compute distances to 3 vectors in Cluster 2 instead of 8.
Result:
- Brute-force: 8 distance calculations.
- IVF: 3 centroid distances + 3 vector distances = 6 calculations (25% faster).

For a dataset with 1B vectors and $K = 256$ :

Brute-force: $1 B \times 1536 = 1.536 \times 10^{12}$ operations.
IVF: $\frac{20}{256} \times 1 B \times 1536 = 1.2 \times 10^{11}$ operations (12.8x faster).

Aspect	IVF
Purpose	Reduces search scope via clustering.
Key Parameters	`num_partitions` ( $K$ ), `nprobe`.
Accuracy Trade-off	Higher $n_{probe}$ improves recall but slows search.
Efficiency	Sub-linear search time ( $O (\frac{n_{probe} \cdot n}{K})$ ).

Next, we’ll explore Product Quantization (PQ)—the second pillar of IVF-PQ!