IVF-PQ Outline
Comprehensive Example Walkthrough: IVF-PQ Process
Let’s use a toy dataset to illustrate IVF-PQ end-to-end, while weaving in the story of why indexing is critical.
1. The Problem: Why Indexing?
Imagine you’re building a facial recognition system for a social media app with 1 billion users. Each face is represented by a 1536-dimensional embedding. To find similar faces for a query, a brute-force scan would compute distances to all 1B vectors:
This is impossible in real time (imagine waiting hours for a search result!).
Solution: Approximate Nearest Neighbor (ANN) indexing like IVF-PQ trades a small accuracy loss for 1,000x speedup. Let’s see how it works with a toy example.
2. Dataset & Query
- Toy Dataset: 8 vectors in
(simplified for illustration): - Query:
(find its nearest neighbors).
3. Step 1: IVF Clustering (Reduce Search Space)
Why Indexing?
Brute-force would compare
- K-means Clustering (
): - Partition
into clusters to avoid scanning all vectors. - Final Clusters:
- Cluster 1 (Red):
, → Centroid . - Cluster 2 (Green):
, , → Centroid . - Cluster 3 (Blue):
, , → Centroid .
- Cluster 1 (Red):
- Partition
- Key Goal Achieved:
Instead of searching all 8 vectors, IVF restricts the search to Cluster 2 (green), reducing the scope by 62.5%.
4. Step 2: PQ Compression (Shrink Data)
Why Indexing?
Storing raw vectors is memory-heavy. PQ compresses them into compact codes.
-
Subspace Decomposition:
Split 2D vectors intosubspaces (1D each): -
Codebook Training (
centroids per subspace): - Subspace 1: Centroids
. - Subspace 2: Centroids
.
- Subspace 1: Centroids
-
PQ Encoding:
- Example: Vector
becomes code , reconstructed as . - Key Goal Achieved: Each vector is stored as 2 integers instead of 2 floats (75% memory reduction).
- Example: Vector
5. Step 3: IVF-PQ Query (Balance Speed & Accuracy)
Why Indexing?
PQ enables fast distance approximations, while IVF ensures we only search relevant data.
- IVF Phase:
- Compute distances to centroids:
- Probe Cluster 2 (green) only.
- PQ Phase:
-
Lookup Tables (LUTs): Precompute subspace distances for
: - Subspace 1:
- Subspace 2:
- Subspace 1:
-
Search in Cluster 2:
-
Vector (PQ Code) | Total |
---|---|
- Result:
- ANN Result: All Cluster 2 vectors have
. - True Distance:
(error = ). - Key Goal Achieved: Speedup 2.6x (vs. brute-force) with minimal accuracy loss.
- ANN Result: All Cluster 2 vectors have
6. The Bigger Picture
Why IVF-PQ Works:
- IVF: Reduces search scope from
to . - PQ: Slashes distance computation from
to .
Trade-offs:
Factor | IVF-PQ | Brute-Force |
---|---|---|
Speed | Sub-linear time | Linear time ( |
Memory | Compact PQ codes ( |
Raw vectors ( |
Accuracy | ~85-95% recall | 100% recall |
7. Back to Your Code
def create_indexed_table(...):
# ...
target_table.create_index(
metric="cosine", # Optimized for text embeddings
num_partitions=256, # Balance cluster granularity
num_sub_vectors=16, # Compress 1536D into 16 subspaces
accelerator="cuda" # Speed up training
)
- Why These Choices?
num_partitions=256
: Balances cluster quality and search scope.num_sub_vectors=16
: Splits 1536D vectors into 96D subspaces for optimal PQ compression.accelerator="cuda"
: GPU accelerates k-means training for large datasets.
Final Takeaway
IVF-PQ tackles the curse of dimensionality by:
- Reducing search space (IVF clusters).
- Compressing data (PQ codes).
- Balancing speed, memory, and accuracy—making billion-scale ANN feasible!
Let's dive into details: