Inverted File Index (IVF)
Let’s explore IVF in depth, using our toy dataset to illustrate how it reduces search space while maintaining accuracy.
1. The Problem: Why IVF?
Imagine you’re searching for a book in a library with 1 million books. Brute-force would mean scanning every shelf. Instead, you:
- Categorize books into sections (e.g., "Sci-Fi," "History").
- Search only relevant sections for your query.
IVF does this for vectors. Let’s formalize it.
2. IVF Workflow
Step 1: Clustering with K-means
Objective: Partition the dataset into
-
K-means Formulation:
Givenvectors , find centroids that minimize: where
is the -th cluster. -
Example:
For our toy dataset (): - Cluster 1 (Red):
, → Centroid . - Cluster 2 (Green):
, , → Centroid . - Cluster 3 (Blue):
, , → Centroid .
- Cluster 1 (Red):
Step 2: Building Inverted Lists
For each centroid
- Cluster 1:
(indices of vectors and ). - Cluster 2:
(indices of green vectors). - Cluster 3:
(indices of blue vectors).
Step 3: Querying with IVF
For a query
- Find Nearest Centroids:
Compute distances to allcentroids: - Probe Clusters:
Select the closestcluster (Cluster 2). - Search Only in Cluster 2:
Compareto vectors .
3. Key Parameters & Trade-offs
Parameter | Effect | Example |
---|---|---|
num_partitions ( |
Higher |
|
nprobe |
Higher |
Mathematical Trade-off
The probability of missing the true nearest neighbor is bounded by:
- Example: For
, : This means a ~8% chance of missing the true neighbor, but search speed improves by 12.8x (since ).
4. Example Walkthrough
-
Clustering:
- Input: 8 vectors in
. - Output: 3 clusters with centroids
, , .
- Input: 8 vectors in
-
Query
: - Step 1: Compare
to centroids. - Step 2: Narrow search to Cluster 2 (green).
- Step 3: Compute distances to 3 vectors in Cluster 2 instead of 8.
- Step 1: Compare
-
Result:
- Brute-force: 8 distance calculations.
- IVF: 3 centroid distances + 3 vector distances = 6 calculations (25% faster).
5. Why This Matters for Real-World Data
For a dataset with 1B vectors and
- Brute-force:
operations. - IVF:
operations (12.8x faster).
6. Summary
Aspect | IVF |
---|---|
Purpose | Reduces search scope via clustering. |
Key Parameters | num_partitions (nprobe . |
Accuracy Trade-off | Higher |
Efficiency | Sub-linear search time ( |
Next, we’ll explore Product Quantization (PQ)—the second pillar of IVF-PQ!