IVF-PQ Outline

Comprehensive Example Walkthrough: IVF-PQ Process

Let’s use a toy dataset to illustrate IVF-PQ end-to-end, while weaving in the story of why indexing is critical.

1. The Problem: Why Indexing?

Imagine you’re building a facial recognition system for a social media app with 1 billion users. Each face is represented by a 1536-dimensional embedding. To find similar faces for a query, a brute-force scan would compute distances to all 1B vectors:

Complexity=O(nd)=109×1536=1.536×1012 operations.

This is impossible in real time (imagine waiting hours for a search result!).

Solution: Approximate Nearest Neighbor (ANN) indexing like IVF-PQ trades a small accuracy loss for 1,000x speedup. Let’s see how it works with a toy example.

2. Dataset & Query

3. Step 1: IVF Clustering (Reduce Search Space)

Why Indexing?
Brute-force would compare q to all 8 vectors. But IVF-PQ narrows the search to a subset.

  1. K-means Clustering (K=3):
    • Partition X into clusters to avoid scanning all vectors.
    • Final Clusters:
      • Cluster 1 (Red): [1.1,0.9], [1.0,1.2] → Centroid μ1=[1.05,1.05].
      • Cluster 2 (Green): [5.2,5.0], [5.1,4.8], [5.3,5.1] → Centroid μ2=[5.2,4.97].
      • Cluster 3 (Blue): [9.0,1.0], [9.5,0.8], [8.9,1.2] → Centroid μ3=[9.13,1.0].
  2. Key Goal Achieved:
    Instead of searching all 8 vectors, IVF restricts the search to Cluster 2 (green), reducing the scope by 62.5%.

4. Step 2: PQ Compression (Shrink Data)

Why Indexing?
Storing raw vectors is memory-heavy. PQ compresses them into compact codes.

  1. Subspace Decomposition:
    Split 2D vectors into m=2 subspaces (1D each):

    x=[x1Subspace 1,x2Subspace 2]
  2. Codebook Training (h=4 centroids per subspace):

    • Subspace 1: Centroids C1={1.05,5.2,9.13}.
    • Subspace 2: Centroids C2={1.05,4.97,1.0}.
  3. PQ Encoding:

    • Example: Vector [5.2,5.0] becomes code (2,2), reconstructed as [5.2,4.97].
    • Key Goal Achieved: Each vector is stored as 2 integers instead of 2 floats (75% memory reduction).

5. Step 3: IVF-PQ Query (Balance Speed & Accuracy)

Why Indexing?
PQ enables fast distance approximations, while IVF ensures we only search relevant data.

  1. IVF Phase:
    • Compute distances to centroids:
qμ12=18.2,qμ22=0.07,qμ32=16.5
  1. PQ Phase:
    • Lookup Tables (LUTs): Precompute subspace distances for q=[5.4,5.2]:

      • Subspace 1: LUT1=[18.92,0.04,13.58]
      • Subspace 2: LUT2=[17.22,0.05,17.64]
    • Search in Cluster 2:

Vector (PQ Code) Total d~2
(2,2) 0.04+0.05=0.09
  1. Result:
    • ANN Result: All Cluster 2 vectors have d~2=0.09.
    • True Distance: q[5.2,5.0]2=0.08 (error = 0.01).
    • Key Goal Achieved: Speedup 2.6x (vs. brute-force) with minimal accuracy loss.

6. The Bigger Picture

Why IVF-PQ Works:

Trade-offs:

Factor IVF-PQ Brute-Force
Speed Sub-linear time Linear time (O(nd))
Memory Compact PQ codes (O(mlogh)) Raw vectors (O(nd))
Accuracy ~85-95% recall 100% recall

7. Back to Your Code

def create_indexed_table(...):
    # ...
    target_table.create_index(
        metric="cosine",              # Optimized for text embeddings
        num_partitions=256,           # Balance cluster granularity
        num_sub_vectors=16,           # Compress 1536D into 16 subspaces
        accelerator="cuda"            # Speed up training
    )

Final Takeaway

IVF-PQ tackles the curse of dimensionality by:

  1. Reducing search space (IVF clusters).
  2. Compressing data (PQ codes).
  3. Balancing speed, memory, and accuracy—making billion-scale ANN feasible!

Let's dive into details:

  1. Inverted File Index (IVF)
  2. Product Quantization (PQ)