Inverted File Index (IVF)

Let’s explore IVF in depth, using our toy dataset to illustrate how it reduces search space while maintaining accuracy.

1. The Problem: Why IVF?

Imagine you’re searching for a book in a library with 1 million books. Brute-force would mean scanning every shelf. Instead, you:

  1. Categorize books into sections (e.g., "Sci-Fi," "History").
  2. Search only relevant sections for your query.

IVF does this for vectors. Let’s formalize it.

2. IVF Workflow

Step 1: Clustering with K-means

Objective: Partition the dataset into K clusters (sections) to avoid scanning all vectors.

  1. K-means Formulation:
    Given n vectors X={x1,...,xn}Rd, find K centroids {μ1,...,μK} that minimize:

    min{μk}k=1KxCkxμk2

    where Ck is the k-th cluster.

  2. Example:
    For our toy dataset (K=3):

    • Cluster 1 (Red): [1.1,0.9], [1.0,1.2] → Centroid μ1=[1.05,1.05].
    • Cluster 2 (Green): [5.2,5.0], [5.1,4.8], [5.3,5.1] → Centroid μ2=[5.2,4.97].
    • Cluster 3 (Blue): [9.0,1.0], [9.5,0.8], [8.9,1.2] → Centroid μ3=[9.13,1.0].

Step 2: Building Inverted Lists

For each centroid μk, maintain a list of vectors in its cluster:

Step 3: Querying with IVF

For a query q=[5.4,5.2]:

  1. Find Nearest Centroids:
    Compute distances to all K centroids:qμ12=18.2,qμ22=0.07,qμ32=16.5
  2. Probe Clusters:
    Select the closest nprobe=1 cluster (Cluster 2).
  3. Search Only in Cluster 2:
    Compare q to vectors {[5.2,5.0],[5.1,4.8],[5.3,5.1]}.

3. Key Parameters & Trade-offs

Parameter Effect Example
num_partitions (K) Higher K → Smaller clusters, lower IVF error. K=3 partitions data into 3 clusters.
nprobe Higher nprobe → More clusters searched, higher recall. nprobe=1 probes 1 cluster.
Mathematical Trade-off

The probability of missing the true nearest neighbor is bounded by:

Pmissexp(nprobeK)

4. Example Walkthrough

  1. Clustering:

    • Input: 8 vectors in R2.
    • Output: 3 clusters with centroids μ1, μ2, μ3.
  2. Query q=[5.4,5.2]:

    • Step 1: Compare q to centroids.
    • Step 2: Narrow search to Cluster 2 (green).
    • Step 3: Compute distances to 3 vectors in Cluster 2 instead of 8.
  3. Result:

    • Brute-force: 8 distance calculations.
    • IVF: 3 centroid distances + 3 vector distances = 6 calculations (25% faster).

5. Why This Matters for Real-World Data

For a dataset with 1B vectors and K=256:

6. Summary

Aspect IVF
Purpose Reduces search scope via clustering.
Key Parameters num_partitions (K), nprobe.
Accuracy Trade-off Higher nprobe improves recall but slows search.
Efficiency Sub-linear search time (O(nprobenK)).

Next, we’ll explore Product Quantization (PQ)—the second pillar of IVF-PQ!