Product Quantization (PQ)

Product Quantization (PQ) tackles the curse of dimensionality by compressing high-dimensional vectors into compact codes while preserving approximate distances. Let’s break it down using our toy dataset and formal math.

1. The Problem: Why PQ?

Storing and searching 1536-dimensional vectors (common in NLP/vision) is expensive. For 1B vectors:

Raw storage: $1 B \times 1536 \times 4 bytes = 6.144 TB$ .
Brute-force search: $O (n d)$ complexity is infeasible.

Solution: PQ compresses vectors into 8–16 bytes each and accelerates distance computations.

2. PQ Workflow

Step 1: Subspace Decomposition

Split a $d$ -dimensional vector into $m$ disjoint subspaces:

x = [\underset{Subspace 1}{\underset{⏟}{x_{1}, . . ., x_{d / m}}}, . . ., \underset{Subspace m}{\underset{⏟}{x_{d - d / m + 1}, . . ., x_{d}}}]

Example: For our 2D toy vectors split into $m = 2$ subspaces:

$x = [\underset{Subspace 1}{\underset{⏟}{x_{1}}}, \underset{Subspace 2}{\underset{⏟}{x_{2}}}]$

Step 2: Codebook Training (Mathematical Deep Dive)

Let’s formalize the codebook training process with equations, using the example subspaces from your dataset.

1. K-means in Subspaces

For each subspace $j$ , we solve:

min_{C_{j}} \sum_{x^{(j)} \in X^{(j)}} min_{k \in {1, . . ., h}} ∥ x^{(j)} - c_{j, k} ∥^{2}

where:

$X^{(j)}$ : Subvectors in subspace $j$ .
$C_{j} = {c_{j, 1}, . . ., c_{j, h}}$ : Centroids for subspace $j$ .

2. Example: Subspace 1 (x₁)

Data: $X^{(1)} = {1.1, 1.0, 5.2, 5.1, 5.3, 9.0, 9.5, 8.9}$ (1D subvectors).
Goal: Train $h = 3$ centroids $C_{1} = {c_{1, 1}, c_{1, 2}, c_{1, 3}}$ .

Phase 1: Initialization (k-means++)

Step 1: Randomly pick $c_{1, 1}^{(0)} = 1.1$ .
Step 2: Compute distances $D (x)$ from all points to $c_{1, 1}^{(0)}$ : $D = {(1.1 - 1.1)^{2}, (1.0 - 1.1)^{2}, (5.2 - 1.1)^{2}, . . ., (8.9 - 1.1)^{2}} = [0, 0.01, 16.81, . . ., 60.84]$
Step 3: Sample next centroid $c_{1, 2}^{(0)} = 9.5$ (far from $c_{1, 1}^{(0)}$ ).
Step 4: Update $D (x) = min (∥ x - c_{1, 1}^{(0)} ∥^{2}, ∥ x - c_{1, 2}^{(0)} ∥^{2})$ .
Step 5: Sample $c_{1, 3}^{(0)} = 5.2$ (midpoint between 1.1 and 9.5).

Initial Centroids: $C_{1}^{(0)} = {1.1, 9.5, 5.2}$ .

Phase 2: Lloyd’s Algorithm

Iteration 1:

Assignment:
Assign each $x^{(1)}$ to the nearest centroid:

Subvector	Nearest Centroid
1.1	1.1
1.0	1.1
5.2	5.2
5.1	5.2
5.3	5.2
9.0	9.5
9.5	9.5
8.9	9.5

Update:
Recompute centroids as cluster means: $c_{1, 1}^{(1)} = \frac{1.1 + 1.0}{2} = 1.05$ $c_{1, 2}^{(1)} = \frac{5.2 + 5.1 + 5.3}{3} = 5.2$ $c_{1, 3}^{(1)} = \frac{9.0 + 9.5 + 8.9}{3} = 9.13$

Iteration 2:
Reassign subvectors to new centroids ${1.05, 5.2, 9.13}$ :

Assignments remain unchanged → convergence.

Final Centroids: $C_{1} = {1.05, 5.2, 9.13}$ .

3. Example: Subspace 2 (x₂)

Data: $X^{(2)} = {0.9, 1.2, 5.0, 4.8, 5.1, 1.0, 0.8, 1.2}$ .
Goal: Train $h = 3$ centroids $C_{2} = {c_{2, 1}, c_{2, 2}, c_{2, 3}}$ .

Phase 1: Initialization (k-means++)

Step 1: Randomly pick $c_{2, 1}^{(0)} = 1.2$ .
Step 2: Compute distances $D (x)$ from all points to $c_{2, 1}^{(0)}$ .
Steps 3–5: Sample $c_{2, 2}^{(0)} = 5.1$ and $c_{2, 3}^{(0)} = 0.8$ .

Phase 2: Lloyd’s Algorithm

After iterations, centroids converge to $C_{2} = {1.05, 4.97, 1.0}$ .

4. PQ Encoding (Mathematical Formulation)

For a vector $x = [x_{1}, x_{2}]$ :

Split: $x^{(1)} = x_{1}$ , $x^{(2)} = x_{2}$ .
Quantize:
- Subspace 1: $k_{1} = \arg min_{k} ∥ x_{1} - c_{1, k} ∥^{2}$
- Subspace 2: $k_{2} = \arg min_{k} ∥ x_{2} - c_{2, k} ∥^{2}$
PQ Code: $(k_{1}, k_{2})$ .

Example:
For $x = [5.2, 5.0]$ :

Subspace 1: $∥ 5.2 - 1.05 ∥^{2} = 17.22$ , $∥ 5.2 - 5.2 ∥^{2} = 0$ , $∥ 5.2 - 9.13 ∥^{2} = 15.40$ → $k_{1} = 2$ .
Subspace 2: $∥ 5.0 - 1.05 ∥^{2} = 15.60$ , $∥ 5.0 - 4.97 ∥^{2} = 0.0009$ , $∥ 5.0 - 1.0 ∥^{2} = 16.00$ → $k_{2} = 2$ .
PQ Code: $(2, 2)$ .

5. Lookup Table (LUT) Construction

For a query $q = [q_{1}, q_{2}]$ :

Subspace 1: ${LUT}_{1} [k] = ∥ q_{1} - c_{1, k} ∥^{2} for k = 1, . . ., h$
Subspace 2: ${LUT}_{2} [k] = ∥ q_{2} - c_{2, k} ∥^{2} for k = 1, . . ., h$

Example: For $q = [5.4, 5.2]$ :

Subspace 1 LUT: ${LUT}_{1} = [∥ 5.4 - 1.05 ∥^{2}, ∥ 5.4 - 5.2 ∥^{2}, ∥ 5.4 - 9.13 ∥^{2}] = [18.92, 0.04, 13.58]$
Subspace 2 LUT: ${LUT}_{2} = [∥ 5.2 - 1.05 ∥^{2}, ∥ 5.2 - 4.97 ∥^{2}, ∥ 5.2 - 1.0 ∥^{2}] = [17.22, 0.05, 17.64]$

6. Distance Approximation

For a PQ code $(k_{1}, k_{2})$ :

{\tilde{d}}^{2} (q, x) = {LUT}_{1} [k_{1}] + {LUT}_{2} [k_{2}]

Example:
For PQ code $(2, 2)$ :

{\tilde{d}}^{2} = {LUT}_{1} [2] + {LUT}_{2} [2] = 0.04 + 0.05 = 0.09

7. Why This Works

Training: Codebooks adapt to data distribution, minimizing reconstruction error.
Encoding: Subspace independence allows parallel computation.
Querying: LUTs reduce distance computation to $O (m)$ additions vs. $O (d)$ operations.

This mathematically rigorous process ensures IVF-PQ’s efficiency-accuracy balance!

3. Key Parameters & Trade-offs

Parameter	Effect	Example
`num_sub_vectors` ( $m$ )	Higher $m$ → Lower quantization error, but more LUT operations.	$m = 16$ splits 1536D into 96D subspaces.
Centroids per subspace ( $h$ )	Higher $h$ → Better reconstruction, but larger codebooks.	$h = 256$ (8-bit) balances memory/accuracy.

Mathematical Trade-off

The quantization error for PQ is:

ϵ_{PQ} = \sum_{j = 1}^{m} \sum_{x^{(j)}} ∥ x^{(j)} - c_{j, k_{j}} ∥^{2}

Lower $ϵ_{PQ}$ means ADC distances better approximate true distances.

4. Example Walkthrough

Training:
- Split toy dataset into subspaces.
- Train codebooks with $h = 3$ centroids per subspace.
Encoding:
- Compress vectors into PQ codes (e.g., $(2, 2)$ for $[5.2, 5.0]$ ).
Querying:
- Precompute LUTs for $q = [5.4, 5.2]$ .
- Compute approximate distances in 1 operation per PQ code vs. 2 operations for brute-force.

5. Why This Matters for Real-World Data

For 1B vectors in $R^{1536}$ :

Storage: PQ reduces memory from 6.144 TB to 16 GB ( $16 bytes/vector \times 1 B$ ).
Speed: Distance computations drop from 1536 operations to 16 LUT additions.

6. Summary

Aspect	PQ
Purpose	Compresses vectors and accelerates distance computations.
Key Parameters	`num_sub_vectors` ( $m$ ), centroids per subspace ( $h$ ).
Accuracy Trade-off	Lower $m$ or $h$ increases quantization error.
Efficiency	Distance computations reduced from $O (d)$ to $O (m)$ .

Putting It All Together: IVF-PQ

IVF narrows the search to relevant clusters.
PQ approximates distances within those clusters.
Result: Billion-scale search in milliseconds!

Next, let’s explore IVF-PQ Integration and parameter tuning!