IVF-PQ Outline

Comprehensive Example Walkthrough: IVF-PQ Process

Let’s use a toy dataset to illustrate IVF-PQ end-to-end, while weaving in the story of why indexing is critical.

1. The Problem: Why Indexing?

Imagine you’re building a facial recognition system for a social media app with 1 billion users. Each face is represented by a 1536-dimensional embedding. To find similar faces for a query, a brute-force scan would compute distances to all 1B vectors:

Complexity = O (n \cdot d) = 10^{9} \times 1536 = 1.536 \times 10^{12} operations .

This is impossible in real time (imagine waiting hours for a search result!).

Solution: Approximate Nearest Neighbor (ANN) indexing like IVF-PQ trades a small accuracy loss for 1,000x speedup. Let’s see how it works with a toy example.

2. Dataset & Query

Toy Dataset: 8 vectors in $R^{2}$ (simplified for illustration): $X = {[1.1, 0.9], [1.0, 1.2], [5.2, 5.0], [5.1, 4.8], [5.3, 5.1], [9.0, 1.0], [9.5, 0.8], [8.9, 1.2]}$
Query: $q = [5.4, 5.2]$ (find its nearest neighbors).

3. Step 1: IVF Clustering (Reduce Search Space)

Why Indexing?
Brute-force would compare $q$ to all 8 vectors. But IVF-PQ narrows the search to a subset.

K-means Clustering ( $K = 3$ ):
- Partition $X$ into clusters to avoid scanning all vectors.
- Final Clusters:
  - Cluster 1 (Red): $[1.1, 0.9]$ , $[1.0, 1.2]$ → Centroid $μ_{1} = [1.05, 1.05]$ .
  - Cluster 2 (Green): $[5.2, 5.0]$ , $[5.1, 4.8]$ , $[5.3, 5.1]$ → Centroid $μ_{2} = [5.2, 4.97]$ .
  - Cluster 3 (Blue): $[9.0, 1.0]$ , $[9.5, 0.8]$ , $[8.9, 1.2]$ → Centroid $μ_{3} = [9.13, 1.0]$ .
Key Goal Achieved:
Instead of searching all 8 vectors, IVF restricts the search to Cluster 2 (green), reducing the scope by 62.5%.

4. Step 2: PQ Compression (Shrink Data)

Why Indexing?
Storing raw vectors is memory-heavy. PQ compresses them into compact codes.

Subspace Decomposition:
Split 2D vectors into $m = 2$ subspaces (1D each):
$x = [\underset{Subspace 1}{\underset{⏟}{x_{1}}}, \underset{Subspace 2}{\underset{⏟}{x_{2}}}]$
Codebook Training ( $h = 4$ centroids per subspace):
- Subspace 1: Centroids $C_{1} = {1.05, 5.2, 9.13}$ .
- Subspace 2: Centroids $C_{2} = {1.05, 4.97, 1.0}$ .
PQ Encoding:
- Example: Vector $[5.2, 5.0]$ becomes code $(2, 2)$ , reconstructed as $[5.2, 4.97]$ .
- Key Goal Achieved: Each vector is stored as 2 integers instead of 2 floats (75% memory reduction).

5. Step 3: IVF-PQ Query (Balance Speed & Accuracy)

Why Indexing?
PQ enables fast distance approximations, while IVF ensures we only search relevant data.

IVF Phase:
- Compute distances to centroids:

∥ q - μ_{1} ∥^{2} = 18.2, ∥ q - μ_{2} ∥^{2} = 0.07, ∥ q - μ_{3} ∥^{2} = 16.5

Probe Cluster 2 (green) only.

PQ Phase:
- Lookup Tables (LUTs): Precompute subspace distances for $q = [5.4, 5.2]$ :
  - Subspace 1: ${LUT}_{1} = [18.92, 0.04, 13.58]$
  - Subspace 2: ${LUT}_{2} = [17.22, 0.05, 17.64]$
- Search in Cluster 2:

Vector (PQ Code)	Total ${\tilde{d}}^{2}$
$(2, 2)$	$0.04 + 0.05 = 0.09$

Result:
- ANN Result: All Cluster 2 vectors have ${\tilde{d}}^{2} = 0.09$ .
- True Distance: $∥ q - [5.2, 5.0] ∥^{2} = 0.08$ (error = $0.01$ ).
- Key Goal Achieved: Speedup 2.6x (vs. brute-force) with minimal accuracy loss.

6. The Bigger Picture

Why IVF-PQ Works:

IVF: Reduces search scope from $n$ to $\frac{n_{probe}}{K} \cdot n$ . $For K = 256, n_{probe} = 20 : \frac{20}{256} \approx 7.8 % of data scanned.$
PQ: Slashes distance computation from $O (d)$ to $O (m)$ . $For d = 1536, m = 16 : \frac{1536}{16} = 96 x faster per vector.$

Trade-offs:

Factor	IVF-PQ	Brute-Force
Speed	Sub-linear time	Linear time ( $O (n d)$ )
Memory	Compact PQ codes ( $O (m \log h)$ )	Raw vectors ( $O (n d)$ )
Accuracy	~85-95% recall	100% recall

7. Back to Your Code

def create_indexed_table(...):
    # ...
    target_table.create_index(
        metric="cosine",              # Optimized for text embeddings
        num_partitions=256,           # Balance cluster granularity
        num_sub_vectors=16,           # Compress 1536D into 16 subspaces
        accelerator="cuda"            # Speed up training
    )

Why These Choices?
- num_partitions=256: Balances cluster quality and search scope.
- num_sub_vectors=16: Splits 1536D vectors into 96D subspaces for optimal PQ compression.
- accelerator="cuda": GPU accelerates k-means training for large datasets.

Final Takeaway

IVF-PQ tackles the curse of dimensionality by:

Reducing search space (IVF clusters).
Compressing data (PQ codes).
Balancing speed, memory, and accuracy—making billion-scale ANN feasible!

Let's dive into details: