Colpali Log
2025-2-8
Implemented the Lancedb indexing.
Big issue in modal container and lancedb:
- If we do the indexing on modal with lancedb, we will get indeies but encounter `
RuntimeError: lance error: LanceError(IO): Execution error: ExecNode(Take): thread panicked: task 25 panicked with message "called Result::unwrap() on an Err value: JoinError::Panic(Id(150),\"called Option::unwrap() on a None value\", ...)"
We got no choice, so we file a issue: bug(python): tbl.create_index(metric="cosine") causes Rust panic in Modal container, but works locally · Issue #2105 · lancedb/lancedb
2025-2-9
Reorganized the code. Now the code is clean.
Tested RuntimeError again by copying the local Indies to modal container. Still get same Error.
Next, try quantization.
Insight: Indexing is crucial here. In the Colpali case, indexing does not reduce accuracy. Additionally, even a single image can be indexed effectively since it generates 1,030 vectors, providing sufficient data for PQ (Product Quantization) to learn features. The more images available, the better the indexing performance, approaching 99.999% of native MaxSim computation accuracy.
2025-2-10
Test indexing performance on RTX 4090
Results for query: 'What is Sushan Wild party?'
Indexing | time | result |
---|---|---|
No | 0.035 | data_test_1.png, Distance: 21.97092056274414 Exploring_the_Limits_of_Language_Modeling_page_10.png, Distance: 22.033401489257812 |
Yes | 0.019 | data_test_1.png, Distance: 10.978950500488281 Generating Sequences_With_Recurrent_Neural_Networks_page_31.png, Distance: 13.490697860717773 |
This looks good, but each time indexing is different. Most of time the indexing will messing result. We need a solution for it.
2025-2-11
Went through the theoretical pipeline of how to build best ColPali.
Make a plan:
Stage 1
Stage 2
Stage 3
2025-2-12
Native ColPali finished. test result:
dataset: test-pdfs
Query 1: What is Sushan Wild party?
Query 2: Which party got more women in 112th?
Query 3: Who are transformers paper's authors?
Methods | Query | Score | Time |
---|---|---|---|
Native | Q1 | data_test_1.png (score: 10.004) gao-25-900570_page_74.png (score: 7.956) |
0.28765s |
Native | Q2 | data_test_2.png (score: 17.327) data_test_1.png (score: 14.918) |
0.08996s |
Native | Q3 | gao-25-900570_page_24.png (score: 9.214) gao-25-900570_page_7.png (score: 7.922) |
0.09184s |
2025-2-13
Problem 1: upsert()
of Qdrant has uploading limit which is 17. So we only process with for loop to upsert embeddings.
Today I implemented: HNSW
, mean_pooling_columns
and mean_pooling_rows
and get prefetch.
Some questions has been answered:
Q1: How prefetch Works in Qdrant?
search_queries.append(
QueryRequest(
query=q_embedding,
prefetch=[
Prefetch(query=q_embedding, limit=200, using="mean_pooling_columns"),
Prefetch(query=q_embedding, limit=200, using="mean_pooling_rows")
],
limit=top_k,
with_payload=True,
with_vector=False,
using="original"
)
)
This means:
After Qdrant finds the top top_k matches from "original", it also fetches up to 200 entries from "mean_pooling_columns" and "mean_pooling_rows" that are related to those results.
- Primary Search (using="original")
Your search is performed only on "original", meaning that Qdrant finds the most similar vectors in that space. - Prefetching (prefetch=[...])
After Qdrant retrieves the best matching points (data entries) from "original", it also fetches their related embeddings from "mean_pooling_columns" and "mean_pooling_rows". Prefetched vectors are not used for ranking but can be useful for additional processing.
Q2: Why Use prefetch with mean_pooling_columns and mean_pooling_rows?
2025-2-14
81 Imges
Search with prefetch
pipeline.search_with_text_queries.remote(queries, prefetch_size=20, top_k=3)
Query: What is Sushan Wild party?
- data_test_1.png (score: 10.005)
- Unifying_Multimodal_Retrieval_via_DSE_page_9.png (score: 9.797)
- Exploring_the_Limits_of_Language_Modeling_page_11.png (score: 9.752)
Query: Which party got more women in 112th? - data_test_2.png (score: 17.327)
- data_test_1.png (score: 14.920)
- Generating Sequences_With_Recurrent_Neural_Networks_page_29.png (score: 9.291)
Query: Who are transformers paper's authors? - attention_is_all_you_need_page_7.png (score: 12.354)
- attention_is_all_you_need_page_9.png (score: 11.396)
- attention_is_all_you_need_page_4.png (score: 11.261)
Search time: 0.12687s
Search without prefetch
pipeline.search_without_prefetch.remote(queries, top_k=3)
Query: What is Sushan Wild party?
- data_test_1.png (score: 10.005)
- Unifying_Multimodal_Retrieval_via_DSE_page_9.png (score: 9.797)
- Exploring_the_Limits_of_Language_Modeling_page_11.png (score: 9.752)
Query: Which party got more women in 112th? - data_test_2.png (score: 17.327)
- data_test_1.png (score: 14.920)
- in_context_scheming_reasoning_paper_page_13.png (score: 9.379)
Query: Who are transformers paper's authors? - attention_is_all_you_need_page_7.png (score: 12.354)
- attention_is_all_you_need_page_9.png (score: 11.396)
- attention_is_all_you_need_page_4.png (score: 11.261)
HNSW
hnsw_config=HnswConfigDiff(m=0) # HNSW switched off
Number of neighbours to consider during the index building. Larger the value - more accurate the search, more time required to build index.
Binary Quantization
Query: What is Sushan Wild party?
- Exploring_the_Limits_of_Language_Modeling_page_10.png (score: 1272.000)
- Exploring_the_Limits_of_Language_Modeling_page_11.png (score: 1272.000)
Query: Which party got more women in 112th?
- data_test_2.png (score: 1840.000)
- data_test_1.png (score: 1522.000)
Query: Who are transformers paper's authors?
- attention_is_all_you_need_page_7.png (score: 1396.000)
- attention_is_all_you_need_page_9.png (score: 1396.000)
Search time: 0.19166s
The result is not that perfect. BQ will change the accuracy of MaxSim searching.