Product Quantization (PQ)
Product Quantization (PQ) tackles the curse of dimensionality by compressing high-dimensional vectors into compact codes while preserving approximate distances. Let’s break it down using our toy dataset and formal math.
1. The Problem: Why PQ?
Storing and searching 1536-dimensional vectors (common in NLP/vision) is expensive. For 1B vectors:
- Raw storage:
. - Brute-force search:
complexity is infeasible.
Solution: PQ compresses vectors into 8–16 bytes each and accelerates distance computations.
2. PQ Workflow
Step 1: Subspace Decomposition
Split a
Example: For our 2D toy vectors split into
Step 2: Codebook Training (Mathematical Deep Dive)
Let’s formalize the codebook training process with equations, using the example subspaces from your dataset.
1. K-means in Subspaces
For each subspace
where:
: Subvectors in subspace . : Centroids for subspace .
2. Example: Subspace 1 (x₁)
Data:
Goal: Train
Phase 1: Initialization (k-means++)
- Step 1: Randomly pick
. - Step 2: Compute distances
from all points to : - Step 3: Sample next centroid
(far from ). - Step 4: Update
. - Step 5: Sample
(midpoint between 1.1 and 9.5).
Initial Centroids:
Phase 2: Lloyd’s Algorithm
Iteration 1:
- Assignment:
Assign eachto the nearest centroid:
Subvector | Nearest Centroid |
---|---|
1.1 | 1.1 |
1.0 | 1.1 |
5.2 | 5.2 |
5.1 | 5.2 |
5.3 | 5.2 |
9.0 | 9.5 |
9.5 | 9.5 |
8.9 | 9.5 |
- Update:
Recompute centroids as cluster means:
Iteration 2:
Reassign subvectors to new centroids
- Assignments remain unchanged → convergence.
Final Centroids:
3. Example: Subspace 2 (x₂)
Data:
Goal: Train
Phase 1: Initialization (k-means++)
- Step 1: Randomly pick
. - Step 2: Compute distances
from all points to . - Steps 3–5: Sample
and .
Phase 2: Lloyd’s Algorithm
After iterations, centroids converge to
4. PQ Encoding (Mathematical Formulation)
For a vector
- Split:
, . - Quantize:
- Subspace 1:
- Subspace 2:
- Subspace 1:
- PQ Code:
.
Example:
For
- Subspace 1:
, , → . - Subspace 2:
, , → . - PQ Code:
.
5. Lookup Table (LUT) Construction
For a query
- Subspace 1:
- Subspace 2:
Example: For
- Subspace 1 LUT:
- Subspace 2 LUT:
6. Distance Approximation
For a PQ code
Example:
For PQ code
7. Why This Works
- Training: Codebooks adapt to data distribution, minimizing reconstruction error.
- Encoding: Subspace independence allows parallel computation.
- Querying: LUTs reduce distance computation to
additions vs. operations.
This mathematically rigorous process ensures IVF-PQ’s efficiency-accuracy balance!
3. Key Parameters & Trade-offs
Parameter | Effect | Example |
---|---|---|
num_sub_vectors ( |
Higher |
|
Centroids per subspace ( |
Higher |
Mathematical Trade-off
The quantization error for PQ is:
- Lower
means ADC distances better approximate true distances.
4. Example Walkthrough
-
Training:
- Split toy dataset into subspaces.
- Train codebooks with
centroids per subspace.
-
Encoding:
- Compress vectors into PQ codes (e.g.,
for ).
- Compress vectors into PQ codes (e.g.,
-
Querying:
- Precompute LUTs for
. - Compute approximate distances in 1 operation per PQ code vs. 2 operations for brute-force.
- Precompute LUTs for
5. Why This Matters for Real-World Data
For 1B vectors in
- Storage: PQ reduces memory from 6.144 TB to 16 GB (
). - Speed: Distance computations drop from 1536 operations to 16 LUT additions.
6. Summary
Aspect | PQ |
---|---|
Purpose | Compresses vectors and accelerates distance computations. |
Key Parameters | num_sub_vectors ( |
Accuracy Trade-off | Lower |
Efficiency | Distance computations reduced from |
Putting It All Together: IVF-PQ
- IVF narrows the search to relevant clusters.
- PQ approximates distances within those clusters.
- Result: Billion-scale search in milliseconds!
Next, let’s explore IVF-PQ Integration and parameter tuning!