Product Quantization (PQ)

Product Quantization (PQ) tackles the curse of dimensionality by compressing high-dimensional vectors into compact codes while preserving approximate distances. Let’s break it down using our toy dataset and formal math.

1. The Problem: Why PQ?

Storing and searching 1536-dimensional vectors (common in NLP/vision) is expensive. For 1B vectors:

Solution: PQ compresses vectors into 8–16 bytes each and accelerates distance computations.

2. PQ Workflow

Step 1: Subspace Decomposition

Split a d-dimensional vector into m disjoint subspaces:

x=[x1,...,xd/mSubspace 1,...,xdd/m+1,...,xdSubspace m]

Example: For our 2D toy vectors split into m=2 subspaces:

Step 2: Codebook Training (Mathematical Deep Dive)

Let’s formalize the codebook training process with equations, using the example subspaces from your dataset.

1. K-means in Subspaces

For each subspace j, we solve:

minCjx(j)X(j)mink{1,...,h}x(j)cj,k2

where:

2. Example: Subspace 1 (x₁)

Data: X(1)={1.1,1.0,5.2,5.1,5.3,9.0,9.5,8.9} (1D subvectors).
Goal: Train h=3 centroids C1={c1,1,c1,2,c1,3}.

Phase 1: Initialization (k-means++)
  1. Step 1: Randomly pick c1,1(0)=1.1.
  2. Step 2: Compute distances D(x) from all points to c1,1(0):D={(1.11.1)2,(1.01.1)2,(5.21.1)2,...,(8.91.1)2}=[0,0.01,16.81,...,60.84]
  3. Step 3: Sample next centroid c1,2(0)=9.5 (far from c1,1(0)).
  4. Step 4: Update D(x)=min(xc1,1(0)2,xc1,2(0)2).
  5. Step 5: Sample c1,3(0)=5.2 (midpoint between 1.1 and 9.5).

Initial Centroids: C1(0)={1.1,9.5,5.2}.

Phase 2: Lloyd’s Algorithm

Iteration 1:

  1. Assignment:
    Assign each x(1) to the nearest centroid:
Subvector Nearest Centroid
1.1 1.1
1.0 1.1
5.2 5.2
5.1 5.2
5.3 5.2
9.0 9.5
9.5 9.5
8.9 9.5
  1. Update:
    Recompute centroids as cluster means:c1,1(1)=1.1+1.02=1.05c1,2(1)=5.2+5.1+5.33=5.2c1,3(1)=9.0+9.5+8.93=9.13

Iteration 2:
Reassign subvectors to new centroids {1.05,5.2,9.13}:

Final Centroids: C1={1.05,5.2,9.13}.

3. Example: Subspace 2 (x₂)

Data: X(2)={0.9,1.2,5.0,4.8,5.1,1.0,0.8,1.2}.
Goal: Train h=3 centroids C2={c2,1,c2,2,c2,3}.

Phase 1: Initialization (k-means++)
  1. Step 1: Randomly pick c2,1(0)=1.2.
  2. Step 2: Compute distances D(x) from all points to c2,1(0).
  3. Steps 3–5: Sample c2,2(0)=5.1 and c2,3(0)=0.8.
Phase 2: Lloyd’s Algorithm

After iterations, centroids converge to C2={1.05,4.97,1.0}.

4. PQ Encoding (Mathematical Formulation)

For a vector x=[x1,x2]:

  1. Split: x(1)=x1, x(2)=x2.
  2. Quantize:
    • Subspace 1: k1=argminkx1c1,k2
    • Subspace 2: k2=argminkx2c2,k2
  3. PQ Code: (k1,k2).

Example:
For x=[5.2,5.0]:

5. Lookup Table (LUT) Construction

For a query q=[q1,q2]:

  1. Subspace 1:LUT1[k]=q1c1,k2for k=1,...,h
  2. Subspace 2:LUT2[k]=q2c2,k2for k=1,...,h

Example: For q=[5.4,5.2]:

6. Distance Approximation

For a PQ code (k1,k2):

d~2(q,x)=LUT1[k1]+LUT2[k2]

Example:
For PQ code (2,2):

d~2=LUT1[2]+LUT2[2]=0.04+0.05=0.09

7. Why This Works

This mathematically rigorous process ensures IVF-PQ’s efficiency-accuracy balance!

3. Key Parameters & Trade-offs

Parameter Effect Example
num_sub_vectors (m) Higher m → Lower quantization error, but more LUT operations. m=16 splits 1536D into 96D subspaces.
Centroids per subspace (h) Higher h → Better reconstruction, but larger codebooks. h=256 (8-bit) balances memory/accuracy.

Mathematical Trade-off

The quantization error for PQ is:

ϵPQ=j=1mx(j)x(j)cj,kj2

4. Example Walkthrough

  1. Training:

    • Split toy dataset into subspaces.
    • Train codebooks with h=3 centroids per subspace.
  2. Encoding:

    • Compress vectors into PQ codes (e.g., (2,2) for [5.2,5.0]).
  3. Querying:

    • Precompute LUTs for q=[5.4,5.2].
    • Compute approximate distances in 1 operation per PQ code vs. 2 operations for brute-force.

5. Why This Matters for Real-World Data

For 1B vectors in R1536:

6. Summary

Aspect PQ
Purpose Compresses vectors and accelerates distance computations.
Key Parameters num_sub_vectors (m), centroids per subspace (h).
Accuracy Trade-off Lower m or h increases quantization error.
Efficiency Distance computations reduced from O(d) to O(m).

Putting It All Together: IVF-PQ

  1. IVF narrows the search to relevant clusters.
  2. PQ approximates distances within those clusters.
  3. Result: Billion-scale search in milliseconds!

Next, let’s explore IVF-PQ Integration and parameter tuning!