Technical Report of ColPali
Efficient Document Retrieval with Vision Language Models
Introduction
ColPali is an innovative document retrieval system that directly leverages visual information within documents, specifically images of document pages, bypassing the traditional dependency on text-based extraction pipelines. By employing Vision Language Models (VLMs) and a late interaction matching mechanism, ColPali achieves efficient and accurate document retrieval. This report provides a detailed technical overview of ColPali's architecture and functioning.
Architectural Overview
ColPali's architecture consists of two primary phases: indexing and querying.
- Indexing (Offline): Documents are indexed using a pre-trained VLM to generate rich contextual embeddings from document images, which are stored for later retrieval.
- Querying (Online): User queries are embedded using the same VLM. Relevance scores between the query and indexed documents are calculated via a late interaction matching mechanism.
- Late Interaction Matching and Document Scoring(Offline): Calculates document score using ColPali's late interaction method.
ColPali draws inspiration from the ColBERT architecture and applies the same projection layer to reduce dimensionality to
Here is the architecture of ColPali:
flowchart LR subgraph Input I1[Document Image
384x384] --> PE[Patch Embedding
28x28 patches] Q[Query] --> QE[Query Tokens] end subgraph PaliGemma PE --> VE[Vision Encoder
SigLIP] QE --> TE[Text Encoder
Gemma-2B] VE & TE --> Proj[Projection Layer
D=128] end subgraph Output Proj --> LI[Late Interaction
Matching] LI --> S[Similarity Score] end style Input fill:#f5f5f5,stroke:#333,stroke-width:2px style PaliGemma fill:#f5f5f5,stroke:#333,stroke-width:2px style Output fill:#f5f5f5,stroke:#333,stroke-width:2px
Indexing Phase
The indexing process begins with document images. Each page of a document is treated as a single unit. These images are of size
Vision Encoder (PaliGemma-3B's Vision Component)
ColPali employs the vision encoder component of the PaliGemma-3B model architecture to extract meani:ngful features from document images. The process starts by dividing the input document image, which is of size
Specifically, the vision encoder is designed to output a
where each
Contextualization: Language Model Integration (PaliGemma-3B's Language Component)
The contextualization of the image features begins by passing the output embeddings, denoted as
Note: Raw Embeddings vs. Contextualized Embeddings
It is critical to understand the transformation that is happening here. The embeddings
Feature | Raw Embeddings ( |
Contextualized Embeddings ( |
---|---|---|
Source | Vision Encoder (SigLIP) | Language Model (PaliGemma) |
Primary Information | Local visual features, individual patch characteristics like color, shapes, and textures | Visual features + Spatial & layout context + Semantically influenced. |
Context Awareness | Independent (does not know any other patch or location). | Interdependent and Context Aware (through attention) |
Dimensionality | 1024 | 1024 |
Semantic Understanding | Limited visual recognition, without any semantic influence. | Influenced by the language model's knowledge of semantics. |
Patch Relationship | Each embedding is independent of the others. | Each embedding is influenced by others. |
Usage | Raw Input to the language model. | Input to the projection layer and then used for late interaction mechanism . |
Specifically, the transformer layers within the language model, which include self-attention and feed-forward networks, act on each of the embeddings, allowing each patch’s representation to be influenced by all other patches within the document image. This mechanism of processing, using the self-attention layers, is what provides the essential contextualization of visual content. Through this process of self-attention and processing, each patch embedding is replaced with a new high dimensional vector representation. Importantly, these new contextualized embeddings integrate not only visual characteristics of a given patch but also information about the spatial arrangement of patches and their visual relationships within the document as understood by the language model. The contextualized embeddings maintain the high dimensionality of the input and the dimensionality of these contextualized embeddings is based on the internal dimension of language model which is
To notate these new contextualized embeddings, we will use
where each
Projection Layer
Following the contextualization step, the
This makes ColPali embeddings compatible with fast similarity search techniques. The projection layer itself performs a straightforward linear transformation. It involves a learnable matrix
After this operation, we will now have
Querying Phrase
The online querying phase of ColPali is where user queries are transformed into a set of embeddings and then matched against the document embeddings previously stored in the index. This process involves several key steps, which we'll explore in detail below.
Text Encoder (PaliGemma-3B's Language Component)
The user query is passed through the language model component of the PaliGemma-3B. This stage involves tokenization, meaning that the user query text is first broken down into individual words or sub-word units (tokens). Each of these tokens is then processed by the language model, producing a high-dimensional vector representation. The language model has been trained on vast corpora of text data, thus can associate meaningful vector embeddings to words and sub-word units. The language model output, which we will denote as
Note: Tokenization
Tokenization is a fundamental step in Natural Language Processing (NLP) and it involves breaking down text into smaller units, known as tokens. These tokens can be words, sub-words (e.g., parts of words), or even characters, depending on the tokenizer’s design. The tokenizer is a crucial component of the language model. In this case the PaliGemma-3B's language model’s tokenizer is employed for tokenization. The specific tokenizer is typically a byte-pair encoding tokenizer which works in following steps.
- Character Mapping: The tokenizer begins by mapping every unicode character to an integer. At this stage, the input text is converted to its corresponding integer representation.
- Vocabulary Creation: The tokenizer then creates a vocabulary set, based on frequencies of text patterns observed during its training. The vocabulary set includes frequently used words and sub-word units. Less frequent words may be split into smaller meaningful sub-word units. For example, words like “unbelievable” can be split into “un”, “believe”, “able” tokens.
- Byte-Pair Encoding (BPE): The core of the tokenization process lies in the byte-pair encoding algorithm. This algorithm starts with each character as a token and iteratively merges most frequently observed character pairs. For example, if we have "low", "lower", "lowest", the BPE algorithm will identify that "low", "er" and "est" frequently occur together. The "low" will likely be a token. Also "er" and "est" will also be tokens. This algorithm creates sub-word units that are able to represent words by combining previously learned sub-word tokens. The iterative merging process continues until it creates a set of tokens that cover most of the texts in its training data. The BPE algorithm is performed only when the tokenizer is being trained.
- Token Lookup: When tokenizing a text, the tokenizer converts the input text to its corresponding integer representation. Then it greedily looks for the longest possible sequence of characters to map to tokens. This ensures that the input tokens will consist of longest frequent patterns.
- Token Representation: Finally, each token, whether a word or a sub-word unit, is represented by an integer ID from the vocabulary set.
Projection Layer
The high-dimensional token embeddings, which were produced by the language model, are now passed through a linear projection layer. The primary purpose of this projection layer is to reduce the dimensionality of the embeddings to
Late Interaction Matching and Document Scoring(Offline)
Late Interaction Matching
The projected query embeddings
Example
Query Token Embeddings (
: Embedding for "energy" ( vector). : Embedding for "consumption" ( vector). : Embedding for "data" ( vector).
Document Patch Embeddings (
, each is a -dimensional vector.
Document Scoring
The relevance score of a document, given the user's query, is calculated by summing all the maximum dot product values, which was extracted in the previous step. This means, each query token contributes to the overall score, and the document score reflects how well the document’s content, as reflected in patch embeddings, aligns with the query tokens. This score, which is calculated in this stage, is used as the final score, upon which documents are ranked and retrieved from database.
Algorithm for Late Interaction Matching and Document Scoring
import numpy as np
def late_interaction_matching(query_embeddings, document_embeddings):
"""
Calculates the document score using late interaction matching.
Args:
query_embeddings (np.array): Array of shape (Q, D) containing query token embeddings.
document_embeddings (np.array): Array of shape (P, D) containing document patch embeddings.
Returns:
float: The computed document score.
"""
Q, D = query_embeddings.shape # Number of query tokens and embedding dimension
P, _ = document_embeddings.shape # Number of patches
document_score = 0.0 # Initialize document score
for j in range(Q):
query_embedding = query_embeddings[j] # Select the current query token embedding
max_dot_product_j = float('-inf') # Initialize maximum dot product for this token to negative infinity
for i in range(P):
patch_embedding = document_embeddings[i] # Select the current patch embedding
dot_product = np.dot(query_embedding, patch_embedding) # Calculate the dot product
if dot_product > max_dot_product_j:
max_dot_product_j = dot_product # Update maximum if needed
document_score += max_dot_product_j # Add the maximum to the document score
return document_score
def dot(vector1, vector2):
"""
Calculate dot product for two vectors
Args:
vector1(np.array): input vector
vector2(np.array): input vector
Returns:
dot product
"""
dot_product = np.sum(vector1*vector2)
return dot_product
# --- Example Usage ---
if __name__ == '__main__':
# Example Embeddings (replace with your actual embeddings)
Q = 3 # Number of Query Tokens
P = 784 # Number of Document Patches
D = 128 # Embedding Dimension
query_embeddings = np.random.rand(Q, D) # Generate Random query embeddings
document_embeddings = np.random.rand(P, D) # Generate random patch embeddings
score = late_interaction_matching(query_embeddings, document_embeddings) # Calculate the score
print(f"Document Score: {score:.4f}")
Mathematical Expression
Here's the formal mathematical representation of the process:
Inputs:
-
Query Token Embeddings: Let
be the set of query token embeddings, where . Each is a -dimensional vector representing the token of the query. is the total number of query tokens. We can also write this as a matrix of size . -
Document Patch Embeddings: Let
be the set of document patch embeddings, where . Each is a D-dimensional vector representing the patch of the document. is the total number of document patches (in our example, ). Also, we can write it as a matrix of size . -
: Represents the embedding dimension (set to in the ColPali paper).
1. Dot Product Calculation:
The dot product between a query token embedding
where
is the element of the vector for query token is the element of the vector for patch embedding
2. Maximum Similarity Per Query Token:
For each query token
This expression can also be written with max function:
This finds the maximum value of all dot product values in the set, for given query vector.
3. Document Scoring:
The final document score
This can also be written by substituting the
Combined Expression:
The full mathematical expression for Late Interaction Matching and Document Scoring can be written as:
Finally, these maximum values are summed up, creating the final document score.
Conclusion
This technical report has meticulously detailed the architecture and workings of ColPali, a novel document retrieval system that effectively bridges the gap between visual and textual understanding. By utilizing the inherent contextual awareness of a language model in processing visual content, ColPali captures the complex relationships between text, tables, figures, and layouts, something that is often overlooked by text-only systems. This design approach, inspired by the ColBERT architecture, enables a computationally efficient way for fast online querying and scaling to large document collections. ColPali not only matches queries to text content but also captures implicit visual cues, thereby enhancing retrieval relevance.
ColPali’s capability is not merely a technical advancement but rather a move towards more human-like information processing. By directly “looking” at a document as a human would, ColPali can unlock previously untapped information sources. Looking ahead, ColPali has a great potential for further improvement and expansion.
Okay, let's delve deeper into the "Looking Ahead" section and brainstorm specific areas within ColPali where improvements could potentially boost performance. This will focus on concrete architectural and algorithmic enhancements.
Here's a list of potential avenues for enhancing ColPali, focusing on specific technical modifications:
-
Vision Encoder Enhancements:
-
Exploring Newer Vision Models:
- Rationale: Replacing the SigLIP-based vision encoder with more recent, higher-performing vision transformer models. This could improve the raw visual feature extraction.
- Specific Ideas:
- Investigate models like DINOv2, or models trained with higher resolutions and larger datasets.
- Explore models that are specialized for document image understanding.
-
Multi-Resolution Feature Extraction:
- Rationale: Current approach uses a single resolution of images. Extracting features at multiple resolutions (e.g., by using a feature pyramid network, or using features from intermediate layers of vision encoders) can capture more granular detail and overall layout structure.
- Implementation: Experiment with multi-resolution approaches to feed to contextualization layers.
-
-
Contextualization Layer Improvements:
-
Advanced Attention Mechanisms:
- Rationale: Exploring different attention mechanisms in the language model's transformer layers. The current self-attention might be suboptimal.
- Specific Ideas:
- Try using sparse attention, or linear attention, to make self-attention more efficient.
- Explore hierarchical attention that learns relationships between patches at multiple levels.
- Integrate multi-head attention mechanism for improved attention and context.
-
Incorporating Positional Information Explicitly:
- Rationale: Although patch order provides some spatial information, providing more explicit positional encodings (e.g. learnable positional encodings, 2D positional embeddings) to the language model could further improve its ability to capture layout information and visual structure.
- Implementation: Incorporate position encoding before passing visual embeddings to the language model.
-
-
Projection Layer Optimization:
- Non-Linear Projection:
- Rationale: Explore the potential of non-linear projection layers, instead of only the linear layer.
- Implementation: Explore non-linear projection such as MLP or other neural network blocks in the projection layer.
- Non-Linear Projection:
-
Late Interaction Matching Refinements:
-
Learned Interaction Weights:
- Rationale: The current approach sums maximum dot products with equal weights for each query token. However, a better model would learn a different weight of every query token based on importance of query tokens, so not all tokens will have equal contribution to the document score.
- Implementation: Introduce a learned weighting mechanism or trainable parameters that weights the contribution of different query tokens in the overall document score. For example, adding learnable weights on every
max_dot_j
before summing it up.
-
Contextualized Query-Patch Interactions:
- Rationale: The current late interaction is a simple dot product. Introduce more complex interactions, for instance, by using MLPs, to consider both query token and patch embeddings together.
- Implementation: Implement a method to combine representations of query and patch embeddings and use an MLP for similarity calculations.
-
-
Loss Function Modifications:
- Hard-Negative Mining:
- Rationale: Using better strategies for selecting negative samples, rather than random sampling, can improve training efficiency. Focus training on samples with large gradient.
- Implementation: Investigate different hard-negative mining techniques such as batch-hard or contrastive hard mining.
- Margin-Based Losses:
- Rationale: Employing a margin-based loss, instead of cross entropy loss, can help the model to learn embeddings that are better separated in the vector space. This will lead to more refined retrieval.
- Implementation: Try different types of margin losses instead of softmaxed cross entropy loss.
- Hard-Negative Mining:
-
Training Data Augmentation:
- Synthetic Data Generation:
- Rationale: Create large diverse synthetic data using language models to generate various queries.
- Implementation: Explore new methods for generating diverse synthetic data.
- Data Augmentation:
- Rationale: Use data augmentation such as rotation, noise, random crop on images to make a model more robust.
- Implementation: Implement a strategy for augmenting data.
- Synthetic Data Generation:
-
Architectural Modifications:
- Multi-Vector Compression:
- Rationale: Find better compression method to reduce index size of multi-vector embedding.
- Implementation: Use methods such as clustering or centroid embeddings to reduce index size.
In conclusion, ColPali represents a significant stride towards document retrieval systems that can truly emulate human-like information processing. By combining the power of vision language models and late interaction mechanisms, ColPali not only addresses the limitations of prior systems, but it also sets the stage for new opportunities and applications in document retrieval. This step into the future shows that by understanding complex visual and textual context, document retrieval will have a crucial impact on how human and machine collaborate in accessing and generating knowledge. By continuing this research into visual and language model collaboration, we will be able to unlock untapped possibilities in the future of information processing.
Reference
Manuel Faysse, ColPali: Efficient Document Retrieval with Vision Language Models, arXiv:2407.01449, 7 Oct 2024(v3)