Colpali Log
2025-2-8
Implemented the Lancedb indexing.
Big issue in modal container and lancedb:
- If we do the indexing on modal with lancedb, we will get indeies but encounter `
RuntimeError: lance error: LanceError(IO): Execution error: ExecNode(Take): thread panicked: task 25 panicked with message "called Result::unwrap() on an Err value: JoinError::Panic(Id(150),\"called Option::unwrap() on a None value\", ...)"
We got no choice, so we file a issue: bug(python): tbl.create_index(metric="cosine") causes Rust panic in Modal container, but works locally · Issue #2105 · lancedb/lancedb
2025-2-9
Reorganized the code. Now the code is clean.
Tested RuntimeError again by copying the local Indies to modal container. Still get same Error.
Next, try quantization.
Insight: Indexing is crucial here. In the Colpali case, indexing does not reduce accuracy. Additionally, even a single image can be indexed effectively since it generates 1,030 vectors, providing sufficient data for PQ (Product Quantization) to learn features. The more images available, the better the indexing performance, approaching 99.999% of native MaxSim computation accuracy.
2025-2-10
Test indexing performance on RTX 4090
Results for query: 'What is Sushan Wild party?'
Indexing | time | result |
---|---|---|
No | 0.035 | data_test_1.png, Distance: 21.97092056274414 Exploring_the_Limits_of_Language_Modeling_page_10.png, Distance: 22.033401489257812 |
Yes | 0.019 | data_test_1.png, Distance: 10.978950500488281 Generating Sequences_With_Recurrent_Neural_Networks_page_31.png, Distance: 13.490697860717773 |
This looks good, but each time indexing is different. Most of time the indexing will messing result. We need a solution for it.
2025-2-11
Went through the theoretical pipeline of how to build best ColPali.
Make a plan:
Stage 1
Stage 2
Stage 3
2025-2-12
Native ColPali finished. test result:
dataset: test-pdfs
Query 1: What is Sushan Wild party?
Query 2: Which party got more women in 112th?
Query 3: Who are transformers paper's authors?
Methods | Query | Score | Time |
---|---|---|---|
Native | Q1 | data_test_1.png (score: 10.004) gao-25-900570_page_74.png (score: 7.956) |
0.28765s |
Native | Q2 | data_test_2.png (score: 17.327) data_test_1.png (score: 14.918) |
0.08996s |
Native | Q3 | gao-25-900570_page_24.png (score: 9.214) gao-25-900570_page_7.png (score: 7.922) |
0.09184s |
2025-2-13
Problem 1: upsert()
of Qdrant has uploading limit which is 17. So we only process with for loop to upsert embeddings.
Today I implemented: HNSW
, mean_pooling_columns
and mean_pooling_rows
and get prefetch.
Some questions has been answered:
Q1: How prefetch Works in Qdrant?
search_queries.append(
QueryRequest(
query=q_embedding,
prefetch=[
Prefetch(query=q_embedding, limit=200, using="mean_pooling_columns"),
Prefetch(query=q_embedding, limit=200, using="mean_pooling_rows")
],
limit=top_k,
with_payload=True,
with_vector=False,
using="original"
)
)
This means:
After Qdrant finds the top top_k matches from "original", it also fetches up to 200 entries from "mean_pooling_columns" and "mean_pooling_rows" that are related to those results.
- Primary Search (using="original")
Your search is performed only on "original", meaning that Qdrant finds the most similar vectors in that space. - Prefetching (prefetch=[...])
After Qdrant retrieves the best matching points (data entries) from "original", it also fetches their related embeddings from "mean_pooling_columns" and "mean_pooling_rows". Prefetched vectors are not used for ranking but can be useful for additional processing.
Q2: Why Use prefetch with mean_pooling_columns and mean_pooling_rows?
2025-2-14
81 Imges
Search with prefetch
pipeline.search_with_text_queries.remote(queries, prefetch_size=20, top_k=3)
Query: What is Sushan Wild party?
- data_test_1.png (score: 10.005)
- Unifying_Multimodal_Retrieval_via_DSE_page_9.png (score: 9.797)
- Exploring_the_Limits_of_Language_Modeling_page_11.png (score: 9.752)
Query: Which party got more women in 112th? - data_test_2.png (score: 17.327)
- data_test_1.png (score: 14.920)
- Generating Sequences_With_Recurrent_Neural_Networks_page_29.png (score: 9.291)
Query: Who are transformers paper's authors? - attention_is_all_you_need_page_7.png (score: 12.354)
- attention_is_all_you_need_page_9.png (score: 11.396)
- attention_is_all_you_need_page_4.png (score: 11.261)
Search time: 0.12687s
Search without prefetch
pipeline.search_without_prefetch.remote(queries, top_k=3)
Query: What is Sushan Wild party?
- data_test_1.png (score: 10.005)
- Unifying_Multimodal_Retrieval_via_DSE_page_9.png (score: 9.797)
- Exploring_the_Limits_of_Language_Modeling_page_11.png (score: 9.752)
Query: Which party got more women in 112th? - data_test_2.png (score: 17.327)
- data_test_1.png (score: 14.920)
- in_context_scheming_reasoning_paper_page_13.png (score: 9.379)
Query: Who are transformers paper's authors? - attention_is_all_you_need_page_7.png (score: 12.354)
- attention_is_all_you_need_page_9.png (score: 11.396)
- attention_is_all_you_need_page_4.png (score: 11.261)
HNSW
hnsw_config=HnswConfigDiff(m=0) # HNSW switched off
Number of neighbours to consider during the index building. Larger the value - more accurate the search, more time required to build index.
Binary Quantization
Query: What is Sushan Wild party?
- Exploring_the_Limits_of_Language_Modeling_page_10.png (score: 1272.000)
- Exploring_the_Limits_of_Language_Modeling_page_11.png (score: 1272.000)
Query: Which party got more women in 112th?
- data_test_2.png (score: 1840.000)
- data_test_1.png (score: 1522.000)
Query: Who are transformers paper's authors?
- attention_is_all_you_need_page_7.png (score: 1396.000)
- attention_is_all_you_need_page_9.png (score: 1396.000)
Search time: 0.19166s
The result is not that perfect. BQ will change the accuracy of MaxSim searching.
2025-2-25
“Full Model vs. State Dict” in PyTorch
Finished classifier training.
2025-3-2
curl -X POST "https://tu-zhenzhao--stylemi-app-v2-api-service.modal.run/search?amount=5"
-H "Content-Type: multipart/form-data"
-F "file=@nail/183491c3-519c-446a-92d2-0e0955ff1eba.jpg"
2025-4-6
compared two classify_images_async
- Embedding one: 30mins
- pre-embeded one: 10s
For 1600 images
2025-4-23
Page not found · GitHub · GitHub
This code can runs large image dataset. Try to upload 20k images.
For 1000 batch: Memory 6.97GB, Low 3.41GB
Each hour with $2.03, ~4k images
$5 for 8k images
2025-5-1
Iteration 1
Using original method for reranking with prefilter
on for final_df
(.where(f"id IN {tuple_ids}", prefilter=True)
)
Direct Search | Reranking Search |
---|---|
Query: batman say 'Good shot robin, and now we'll see who our masked mystery man is' Filename: 1_page_272.png Score : 0.6503 Duration: 21.34 s ---------------------------------------- Filename: 2_page_60.png Score : 0.6483 Duration: 21.34 s ---------------------------------------- Filename: 6_page_224.png Score : 0.6477 Duration: 21.34 s ---------------------------------------- Filename: 1_page_268.png Score : 0.6462 Duration: 21.34 s ---------------------------------------- Filename: 1_page_234.png Score : 0.6431 Duration: 21.34 s ---------------------------------------- = Query: Limit of LLM = Filename: Captain America vol 1 383 (1991) (c2ce-dcp)_page_3.png Score : 0.8310 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 415 (1993) (c2ce-dcp)_page_8.png Score : 0.8257 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 412 (1993) (c2ce-dcp)_page_12.png Score : 0.8245 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 407 (1992) (c2ce-dcp)_page_15.png Score : 0.8162 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 382 (1991) (c2ce-dcp)_page_3.png Score : 0.8138 Duration: 20.16 s ---------------------------------------- |
Query: batman say 'Good shot robin, and now we'll see who our masked mystery man is' Filename: 1_page_303.png Score : 0.6543 Duration: 29.04 s ---------------------------------------- Filename: 4_page_186.png Score : 0.6540 Duration: 29.04 s ---------------------------------------- Filename: 1_page_315.png Score : 0.6514 Duration: 29.04 s ---------------------------------------- Filename: 1_page_268.png Score : 0.6462 Duration: 29.04 s ---------------------------------------- Filename: 1_page_234.png Score : 0.6431 Duration: 29.04 s ---------------------------------------- = Query: Limit of LLM = Filename: Marvel Universe v1 004_page_3.png Score : 0.9017 Duration: 31.92 s ---------------------------------------- Filename: Marvel Universe v1 005_page_4.png Score : 0.8899 Duration: 31.92 s ---------------------------------------- Filename: Marvel Universe v1 001_page_3.png Score : 0.8888 Duration: 31.92 s ---------------------------------------- Filename: 352.08 Spider-Man V1 #19 (Digital)_page_14.png Score : 0.8878 Duration: 31.92 s ---------------------------------------- Filename: Captain America vol 1 263 (1981) (c2ce) (Mazen-DCP)_page_22.png Score : 0.8454 Duration: 31.92 s ---------------------------------------- |
Iteration 2
Now indexed pooling_rows
and pooling_cols
, but no scaler and no original
indexing.
Direct Search | Reranking Search |
---|---|
Query: batman say 'Good shot robin, and now we'll see who our masked mystery man is' Filename: 1_page_272.png Score : 0.6503 Duration: 21.34 s ---------------------------------------- Filename: 2_page_60.png Score : 0.6483 Duration: 21.34 s ---------------------------------------- Filename: 6_page_224.png Score : 0.6477 Duration: 21.34 s ---------------------------------------- Filename: 1_page_268.png Score : 0.6462 Duration: 21.34 s ---------------------------------------- Filename: 1_page_234.png Score : 0.6431 Duration: 21.34 s ---------------------------------------- = Query: Limit of LLM = Filename: Captain America vol 1 383 (1991) (c2ce-dcp)_page_3.png Score : 0.8310 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 415 (1993) (c2ce-dcp)_page_8.png Score : 0.8257 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 412 (1993) (c2ce-dcp)_page_12.png Score : 0.8245 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 407 (1992) (c2ce-dcp)_page_15.png Score : 0.8162 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 382 (1991) (c2ce-dcp)_page_3.png Score : 0.8138 Duration: 20.16 s |
= Query: batman say 'Good shot robin, and now we'll see who our masked mystery man is' = Filename: 3_page_235.png Score : 0.6637 Duration: 25.72 s ---------------------------------------- Filename: 3_page_250.png Score : 0.6624 Duration: 25.72 s ---------------------------------------- Filename: 3_page_20.png Score : 0.6591 Duration: 25.72 s ---------------------------------------- Filename: 3_page_234.png Score : 0.6581 Duration: 25.72 s ---------------------------------------- Filename: 1_page_234.png Score : 0.6431 Duration: 25.72 s ---------------------------------------- = Query: Limit of LLM = Filename: Captain America vol 1 283 (c2ce-dcp)_page_35.png Score : 0.9489 Duration: 22.05 s ---------------------------------------- Filename: Marvel Universe v1 002_page_2.png Score : 0.9303 Duration: 22.05 s ---------------------------------------- Filename: Marvel Universe v1 002_page_29.png Score : 0.9256 Duration: 22.05 s ---------------------------------------- Filename: Captain America vol 1 284 (c2ce-dcp)_page_31.png Score : 0.9251 Duration: 22.05 s ---------------------------------------- Filename: Marvel Universe v1 004_page_3.png Score : 0.9017 Duration: 22.05 s |
Iteration 3
Now indexed pooling_rows
and pooling_cols
and scaler, but no original
indexing.
Direct Search | Reranking Search |
---|---|
Query: batman say 'Good shot robin, and now we'll see who our masked mystery man is' Filename: 1_page_272.png Score : 0.6503 Duration: 21.34 s ---------------------------------------- Filename: 2_page_60.png Score : 0.6483 Duration: 21.34 s ---------------------------------------- Filename: 6_page_224.png Score : 0.6477 Duration: 21.34 s ---------------------------------------- Filename: 1_page_268.png Score : 0.6462 Duration: 21.34 s ---------------------------------------- Filename: 1_page_234.png Score : 0.6431 Duration: 21.34 s ---------------------------------------- = Query: Limit of LLM = Filename: Captain America vol 1 383 (1991) (c2ce-dcp)_page_3.png Score : 0.8310 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 415 (1993) (c2ce-dcp)_page_8.png Score : 0.8257 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 412 (1993) (c2ce-dcp)_page_12.png Score : 0.8245 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 407 (1992) (c2ce-dcp)_page_15.png Score : 0.8162 Duration: 20.16 s ---------------------------------------- Filename: Captain America vol 1 382 (1991) (c2ce-dcp)_page_3.png Score : 0.8138 Duration: 20.16 s |
= Query: batman say 'Good shot robin, and now we'll see who our masked mystery man is' = Filename: 3_page_235.png Score : 0.6637 Duration: 1.60 s ---------------------------------------- Filename: 3_page_250.png Score : 0.6624 Duration: 1.60 s ---------------------------------------- Filename: 3_page_20.png Score : 0.6591 Duration: 1.60 s ---------------------------------------- Filename: 3_page_234.png Score : 0.6581 Duration: 1.60 s ---------------------------------------- Filename: 1_page_234.png Score : 0.6431 Duration: 1.60 s ---------------------------------------- = Query: Limit of LLM = Filename: Captain America vol 1 283 (c2ce-dcp)_page_35.png Score : 0.9489 Duration: 2.72 s ---------------------------------------- Filename: Marvel Universe v1 002_page_2.png Score : 0.9303 Duration: 2.72 s ---------------------------------------- Filename: Marvel Universe v1 002_page_29.png Score : 0.9256 Duration: 2.72 s ---------------------------------------- Filename: Captain America vol 1 284 (c2ce-dcp)_page_31.png Score : 0.9251 Duration: 2.72 s ---------------------------------------- Filename: Marvel Universe v1 004_page_3.png Score : 0.9017 Duration: 2.72 s |
Iteration 4
Now indexed pooling_rows
and pooling_cols
and scaler and original
indexing.
original
indexing time
Direct Search | Reranking Search |
---|---|
= Query: batman say 'Good shot robin, and now we'll see who our masked mystery man is' = Filename: 5_page_156.png Score : 0.6910 Duration: 0.87 s ---------------------------------------- Filename: 3_page_173.png Score : 0.6894 Duration: 0.87 s ---------------------------------------- Filename: 2_page_159.png Score : 0.6868 Duration: 0.87 s ---------------------------------------- Filename: 2_page_385.png Score : 0.6832 Duration: 0.87 s ---------------------------------------- Filename: Captain America vol 1 400 (1992) (c2ce-dcp)_page_30.png Score : 0.6819 Duration: 0.87 s ---------------------------------------- = Query: Limit of LLM = Filename: Captain America vol 1 382 (1991) (c2ce-dcp)_page_3.png Score : 0.9471 Duration: 0.36 s ---------------------------------------- Filename: Captain America vol 1 406 (1992) (c2ce-dcp)_page_24.png Score : 0.9397 Duration: 0.36 s ---------------------------------------- Filename: Captain America vol 1 407 (1992) (c2ce-dcp)_page_15.png Score : 0.9275 Duration: 0.36 s ---------------------------------------- Filename: Daredevil 157 (03-1979)(HD)(C2C)(RexTyler-DCP)_page_11.png Score : 0.9242 Duration: 0.36 s ---------------------------------------- Filename: 06. Thor 373_page_12.png Score : 0.9238 Duration: 0.36 s ---------------------------------------- |
= Query: batman say 'Good shot robin, and now we'll see who our masked mystery man is' = Filename: 3_page_235.png Score : 0.6637 Duration: 1.60 s ---------------------------------------- Filename: 3_page_250.png Score : 0.6624 Duration: 1.60 s ---------------------------------------- Filename: 3_page_20.png Score : 0.6591 Duration: 1.60 s ---------------------------------------- Filename: 3_page_234.png Score : 0.6581 Duration: 1.60 s ---------------------------------------- Filename: 1_page_234.png Score : 0.6431 Duration: 1.60 s ---------------------------------------- = Query: Limit of LLM = Filename: Captain America vol 1 283 (c2ce-dcp)_page_35.png Score : 0.9489 Duration: 2.72 s ---------------------------------------- Filename: Marvel Universe v1 002_page_2.png Score : 0.9303 Duration: 2.72 s ---------------------------------------- Filename: Marvel Universe v1 002_page_29.png Score : 0.9256 Duration: 2.72 s ---------------------------------------- Filename: Captain America vol 1 284 (c2ce-dcp)_page_31.png Score : 0.9251 Duration: 2.72 s ---------------------------------------- Filename: Marvel Universe v1 004_page_3.png Score : 0.9017 Duration: 2.72 s |
2025-5-6
Key facts about our table
- 25 000 rows – each is a ColPali page.
- multivector columns (pooled_rows, pooled_cols, original) hold ≈1 030 token-vectors per row.
- Only IVF-PQ + cosine is available for multivectors in LanceDB Cloud.
id
is high-cardinality (mostly unique). BTREE is the recommended scalar index.- All calls are remote (RemoteTable) → every search is a separate HTTPS round-trip.
Baseline:
no indices
stage | server work | latency |
---|---|---|
pooled_cols search | flat scan (25 k × 1 030 dots) | ≈ 10 s |
pooled_rows search | flat scan again | ≈ 10 s |
original refine | flat scan, then discard rows not in id IN (…) (post-filter, because no scalar idx) | ≈ 10 s |
total | 3 scans × 25 k rows | 32 – 35 s |
Plan shows full_scan: true
, prefilter: false
.
Add IVF-PQ on the two pooled columns
- What changed – the first two scans become ANN look-ups; each touches only a few hundred centroids + PQ codes.
- What did not change –
- The IN (…) filter still has to walk the whole table (no scalar index).
- Distance for the refine step still computed on all 25 k rows because the pre-filter has no rapid way to eliminate them.
stage | latency now |
---|---|
pooled_cols (IVF-PQ, cosine) | ≈ 40 ms |
pooled_rows (IVF-PQ, cosine) | ≈ 40 ms |
original refine (flat scan) | ≈ 22–25 s |
total | 22 – 26 s (my log) |
The 9–10 s gain we observed is exactly the two full scans you removed.
Add BTREE scalar index on
id
- The WHERE id IN (…) PREFILTER now executes via the BTREE in O(|candidates| log N) instead of a table walk.
- Pre-filter runs before distance-computation, so only the 50 + 50 ≈ 200 candidate rows survive to the refine step.
- Distance ops: 200 rows × 1 030 ≈ 206 k dot-products → a few ms.
- Network overhead still 3 RTTs, but each request body is only a few kB.
stage | latency now |
---|---|
pooled_cols IVF-PQ | 40 ms |
pooled_rows IVF-PQ | 40 ms |
original refine on 200 rows | ~ 500 ms |
total (including 3× HTTPS) | 1.6 – 2.7 s |
Query-to-query variation (1.6 vs 2.7 s) is mostly:
- size of the IN (…) list (Batman query produced ≈180 IDs, LLM query ≈310),
- job queuing on the shared Cloud worker,
- network round-trip jitter.
Execution plan now shows
VectorSearch
index: IVF_PQ(cosine) -- for pooled vectors
prefilter: true
row_count: 197 -- refine touches only ~200 rows
full_scan: false -- ✅ no table-wide scan
Why we didn’t index the “original” column
- At 25 k pages it was already < 1 s after rerank; indexing “original” would shave only a few hundred ms.
- Keeping it flat lets us A/B the quality difference between full token-set vs. pooled ANN embeddings.
If we need even lower latency later (or many more pages), we can build an HNSW on a single-vector surrogate (e.g., CLS-token embedding) or switch prefetch_limit down to 50.
Take-away table
configuration | pooled vectors index | scalar id index | refine rows | total latency |
---|---|---|---|---|
None | ✗ | ✗ | 25 000 | 32–35 s |
IVF-PQ on pooled | ✓ | ✗ | 25 000 | 22–26 s |
IVF-PQ + BTREE (your final) | ✓ | ✓ | ~200 | 1.6–2.7 s |
11:26am |
Size of Multi-vector
Original | Pooled |
---|---|
1030 vectors × 128 dimensions = 131,840 float32 numbers |
1030 vectors × 128 dimensions = 4,864 float32 numbers |
~527.36 KB per image | ~19 KB per image |
25,000 images 13,184,000,000 bytes | 25,000 = 486,400,000 bytes |
1030 × 25,000 = 25,750,000 vectors = 25M |
38 × 25,000 = 950,000 vectors < 1M |
12.28 GB | 0.45 GB |
Eg:
Reranker. rerank number: 200