Technical Report of ColPali

Efficient Document Retrieval with Vision Language Models

Introduction

ColPali is an innovative document retrieval system that directly leverages visual information within documents, specifically images of document pages, bypassing the traditional dependency on text-based extraction pipelines. By employing Vision Language Models (VLMs) and a late interaction matching mechanism, ColPali achieves efficient and accurate document retrieval. This report provides a detailed technical overview of ColPali's architecture and functioning.

Architectural Overview

ColPali's architecture consists of two primary phases: indexing and querying.

Indexing (Offline): Documents are indexed using a pre-trained VLM to generate rich contextual embeddings from document images, which are stored for later retrieval.
Querying (Online): User queries are embedded using the same VLM. Relevance scores between the query and indexed documents are calculated via a late interaction matching mechanism.
Late Interaction Matching and Document Scoring(Offline): Calculates document score using ColPali's late interaction method.

ColPali draws inspiration from the ColBERT architecture and applies the same projection layer to reduce dimensionality to $128$ .

Here is the architecture of ColPali:

flowchart LR
    subgraph Input
        I1[Document Image
384x384] --> PE[Patch Embedding
28x28 patches]
        Q[Query] --> QE[Query Tokens]
    end

    subgraph PaliGemma
        PE --> VE[Vision Encoder
SigLIP]
        QE --> TE[Text Encoder
Gemma-2B]
        VE & TE --> Proj[Projection Layer
D=128]
    end

    subgraph Output
        Proj --> LI[Late Interaction
Matching]
        LI --> S[Similarity Score]
    end

    style Input fill:#f5f5f5,stroke:#333,stroke-width:2px
    style PaliGemma fill:#f5f5f5,stroke:#333,stroke-width:2px
    style Output fill:#f5f5f5,stroke:#333,stroke-width:2px

Indexing Phase

The indexing process begins with document images. Each page of a document is treated as a single unit. These images are of size $384 \times 384$ pixels by resizing. There will be $28 \times 28 = 784$ patches as output and converted to $784$ embeddings. We call those $784$ embeddings as patch embeddings.

Vision Encoder (PaliGemma-3B's Vision Component)

ColPali employs the vision encoder component of the PaliGemma-3B model architecture to extract meani:ngful features from document images. The process starts by dividing the input document image, which is of size $384 \times 384$ pixels, into a grid of non-overlapping patches. These patches are of size $14 \times 14$ pixels. Given the dimensions of the document image, the division results in a $28 \times 28$ patch grid, which yields approximately $748$ patches for each input image. Although the math might not result in a whole integer for the number of patches, it is likely that internal padding or adjustment might happen inside the model so the number of patches is exactly $748$ . Each of these $748$ image patches is then processed independently through the vision encoder layers. For each patch, the encoder produces a high-dimensional vector representation which we call the patch embeddings.

Specifically, the vision encoder is designed to output a $1024$ -dimensional vector representation for each of the $748$ patches. To denote these patch embeddings, we will use the notation $E d_{r a w}$ , where $E d_{r a w}$ is a collection of embeddings and each embedding is of form of $e d_{i}$ , from $i = 1$ to $748$ . Each individual $e d_{i}$ is a $1024$ -dimensional vector and it corresponds to the $i_{t h}$ patch on the image.

E d_{r a w} = {e d_{1}, e d_{2}, \dots, e d_{729}}

where each $e d$ is $1024$ dimensions.

Contextualization: Language Model Integration (PaliGemma-3B's Language Component)

The contextualization of the image features begins by passing the output embeddings, denoted as $E d_{r a w}$ , from the Vision Encoder to the language model component of the PaliGemma-3B architecture. The language model, having been trained on vast text datasets, possesses the capability to understand semantic meanings and contextual relationships. This ability is key to contextualizing visual features. The language model processes the input embeddings, treating each embedding as a token in a sequence.

Note: Raw Embeddings vs. Contextualized Embeddings

It is critical to understand the transformation that is happening here. The embeddings $E d_{r a w}$ , coming directly from the vision encoder (SigLIP) are representations of local visual features. Each such embedding contains information primarily about a single patch. That is, it will have information about color, texture and shapes that exists within that patch, but it lacks information about how it spatially relates to other patches in the image. In contrast to that, the contextualized embeddings that are produced through this step, integrate information about global and local relationships by leveraging the language model’s attention mechanisms. This step provides the model with the ability to not only understand the shapes in the individual patch, but also the spatial relationship between the different patches, and relationships between various layout elements like titles and tables.

Feature	Raw Embeddings ( $E d_{r a w}$ )	Contextualized Embeddings ( $E d_{c o n t e x t}$ )
Source	Vision Encoder (SigLIP)	Language Model (PaliGemma)
Primary Information	Local visual features, individual patch characteristics like color, shapes, and textures	Visual features + Spatial & layout context + Semantically influenced.
Context Awareness	Independent (does not know any other patch or location).	Interdependent and Context Aware (through attention)
Dimensionality	1024	1024
Semantic Understanding	Limited visual recognition, without any semantic influence.	Influenced by the language model's knowledge of semantics.
Patch Relationship	Each embedding is independent of the others.	Each embedding is influenced by others.
Usage	Raw Input to the language model.	Input to the projection layer and then used for late interaction mechanism .

Specifically, the transformer layers within the language model, which include self-attention and feed-forward networks, act on each of the embeddings, allowing each patch’s representation to be influenced by all other patches within the document image. This mechanism of processing, using the self-attention layers, is what provides the essential contextualization of visual content. Through this process of self-attention and processing, each patch embedding is replaced with a new high dimensional vector representation. Importantly, these new contextualized embeddings integrate not only visual characteristics of a given patch but also information about the spatial arrangement of patches and their visual relationships within the document as understood by the language model. The contextualized embeddings maintain the high dimensionality of the input and the dimensionality of these contextualized embeddings is based on the internal dimension of language model which is $1024$ in this case.

To notate these new contextualized embeddings, we will use $E d_{c o n t e x t}$ . This is a set of $784$ patch embeddings, denoted as

E d_{c o n t e x t} = {e d_{1_{c o n t e x t}}, e d_{2_{c o n t e x t}}, \dots, e d_{784_{c o n t e x t}}},

where each $e d_{i_{c o n t e x t}}$ represents the $i_{t h}$ patch and it is a $1024$ -dimensional vector. It’s important to mention here that the order of patch embeddings is maintained throughout this process. The key impact of this step is to generate visual patch embeddings that encode the context of a given patch, and its relationship to other patches in the same document. The dimension of these contextualized embeddings, which is set based on the language model dimension, is $1024$ for PaliGemma-3B.

Projection Layer

Following the contextualization step, the $1024$ -dimensional contextualized patch embeddings are passed through a linear projection layer, designed to reduce the dimensionality of these vectors to a lower dimension, $D$ , which is set to $128$ . The primary motivation for this dimensionality reduction is to increase computational efficiency and reduce the storage footprint, especially since the index needs to scale to large document corpora. This process is a conscious design choice based on the ColBERT architecture, which also uses embeddings with a $128$ -dimensional output.

This makes ColPali embeddings compatible with fast similarity search techniques. The projection layer itself performs a straightforward linear transformation. It involves a learnable matrix $W$ and a bias vector $b$ . The output of the projection is a new vector calculated by multiplying the input embedding with the matrix $W$ , then adding the bias vector $b$ , as described by the function: $n e w_{e m b e d d i n g} = W \cdot i n p u t_{e m b e d d i n g} + b$ . Through this transformation, each $1024$ -dimensional contextualized patch embedding is projected down to a 128-dimensional vector.

After this operation, we will now have $784$ projected patch embeddings, denoted as $E d$ . This is a collection of embeddings with a form of $E d = e d_{1}, e d_{2}, \dots, e d_{784}$ , where each $e d_{i}$ is now a $128$ -dimensional vector. These newly projected vectors are “ColBERT-style” embeddings, meaning they are dimensionally consistent with the outputs of ColBERT models. Finally, this set of projected embeddings, $E d$ (784 $128$ -dimensional vectors), for a given document page is stored in an index, which can be a vector database. This step completes the indexing stage of the document processing.

Querying Phrase

The online querying phase of ColPali is where user queries are transformed into a set of embeddings and then matched against the document embeddings previously stored in the index. This process involves several key steps, which we'll explore in detail below.

Text Encoder (PaliGemma-3B's Language Component)

The user query is passed through the language model component of the PaliGemma-3B. This stage involves tokenization, meaning that the user query text is first broken down into individual words or sub-word units (tokens). Each of these tokens is then processed by the language model, producing a high-dimensional vector representation. The language model has been trained on vast corpora of text data, thus can associate meaningful vector embeddings to words and sub-word units. The language model output, which we will denote as $E q_{r a w}$ , is thus a set of embeddings, and each element of the set $e q_{j_{r a w}}$ is a high-dimensional vector, representing $j_{t h}$ query token. As in the indexing stage, the original dimension of these embedding is $1024$ .

Note: Tokenization

Tokenization is a fundamental step in Natural Language Processing (NLP) and it involves breaking down text into smaller units, known as tokens. These tokens can be words, sub-words (e.g., parts of words), or even characters, depending on the tokenizer’s design. The tokenizer is a crucial component of the language model. In this case the PaliGemma-3B's language model’s tokenizer is employed for tokenization. The specific tokenizer is typically a byte-pair encoding tokenizer which works in following steps.

Character Mapping: The tokenizer begins by mapping every unicode character to an integer. At this stage, the input text is converted to its corresponding integer representation.
Vocabulary Creation: The tokenizer then creates a vocabulary set, based on frequencies of text patterns observed during its training. The vocabulary set includes frequently used words and sub-word units. Less frequent words may be split into smaller meaningful sub-word units. For example, words like “unbelievable” can be split into “un”, “believe”, “able” tokens.
Byte-Pair Encoding (BPE): The core of the tokenization process lies in the byte-pair encoding algorithm. This algorithm starts with each character as a token and iteratively merges most frequently observed character pairs. For example, if we have "low", "lower", "lowest", the BPE algorithm will identify that "low", "er" and "est" frequently occur together. The "low" will likely be a token. Also "er" and "est" will also be tokens. This algorithm creates sub-word units that are able to represent words by combining previously learned sub-word tokens. The iterative merging process continues until it creates a set of tokens that cover most of the texts in its training data. The BPE algorithm is performed only when the tokenizer is being trained.
Token Lookup: When tokenizing a text, the tokenizer converts the input text to its corresponding integer representation. Then it greedily looks for the longest possible sequence of characters to map to tokens. This ensures that the input tokens will consist of longest frequent patterns.
Token Representation: Finally, each token, whether a word or a sub-word unit, is represented by an integer ID from the vocabulary set.

Projection Layer

The high-dimensional token embeddings, which were produced by the language model, are now passed through a linear projection layer. The primary purpose of this projection layer is to reduce the dimensionality of the embeddings to $128$ , mirroring the approach used in the indexing phase, which leads to compatibility with index and also reduces memory. This projection layer is identical to the projection layer used in the indexing phase. This means that the learnable parameters $W$ and $b$ are same as the learnable parameters of projection layer in indexing stage. Each $1024$ -dimensional token embedding from previous step is transformed through the linear function, and the result is a new $128$ -dimensional vector. We will denote the resulting set of the embeddings as $E q$ , where each element $e q_{j}$ is a $128$ -dimensional vector.

Late Interaction Matching and Document Scoring(Offline)

Late Interaction Matching

The projected query embeddings $E q$ , which are now $128$ -dimensional, are then used for the matching phase with previously indexed document patch embeddings. This part of the architecture implements a “late interaction” mechanism where relationships between query tokens and document patches are computed. The method calculates a similarity score between each query token embedding and each document patch embedding. For each such interaction, a simple dot product of the query and document embedding is calculated. Then, for every query token, the maximum dot product value across all the document patch embeddings are extracted. This effectively identifies the document patch with the greatest relevance to each specific query token.

Example

Query Token Embeddings ( $E q$ ): Let's assume a user query like "energy consumption data", which tokenizes into three tokens: "energy", "consumption", and "data". After the text encoder and projection layer, we have three $128$ -dimensional query embeddings:

$e q_{1}$ : Embedding for "energy" ( $1 \times 128$ vector).
$e q_{2}$ : Embedding for "consumption" ( $1 \times 128$ vector).
$e q_{3}$ : Embedding for "data" ( $1 \times 128$ vector).

Document Patch Embeddings ( $E d$ ): Assume we have $784$ patch embeddings for our chosen document page each with $128$ dimensions.

$E d = {e d_{1}, e d_{2}, \dots, e d_{784}}$ , each $e d_{i}$ is a $128$ -dimensional vector.

Document Scoring

The relevance score of a document, given the user's query, is calculated by summing all the maximum dot product values, which was extracted in the previous step. This means, each query token contributes to the overall score, and the document score reflects how well the document’s content, as reflected in patch embeddings, aligns with the query tokens. This score, which is calculated in this stage, is used as the final score, upon which documents are ranked and retrieved from database.

Algorithm for Late Interaction Matching and Document Scoring

import numpy as np

def late_interaction_matching(query_embeddings, document_embeddings):
    """
    Calculates the document score using late interaction matching.

    Args:
        query_embeddings (np.array): Array of shape (Q, D) containing query token embeddings.
        document_embeddings (np.array): Array of shape (P, D) containing document patch embeddings.

    Returns:
        float: The computed document score.
    """
    Q, D = query_embeddings.shape # Number of query tokens and embedding dimension
    P, _ = document_embeddings.shape # Number of patches

    document_score = 0.0 # Initialize document score

    for j in range(Q):
        query_embedding = query_embeddings[j] # Select the current query token embedding
        max_dot_product_j = float('-inf')  # Initialize maximum dot product for this token to negative infinity

        for i in range(P):
            patch_embedding = document_embeddings[i] # Select the current patch embedding
            dot_product = np.dot(query_embedding, patch_embedding) # Calculate the dot product

            if dot_product > max_dot_product_j:
                max_dot_product_j = dot_product # Update maximum if needed

        document_score += max_dot_product_j # Add the maximum to the document score

    return document_score


def dot(vector1, vector2):
  """
    Calculate dot product for two vectors
    Args:
      vector1(np.array): input vector
      vector2(np.array): input vector
    Returns:
      dot product
  """
  dot_product = np.sum(vector1*vector2)
  return dot_product

# --- Example Usage ---
if __name__ == '__main__':
    # Example Embeddings (replace with your actual embeddings)
    Q = 3   # Number of Query Tokens
    P = 784 # Number of Document Patches
    D = 128 # Embedding Dimension
    query_embeddings = np.random.rand(Q, D) # Generate Random query embeddings
    document_embeddings = np.random.rand(P, D) # Generate random patch embeddings

    score = late_interaction_matching(query_embeddings, document_embeddings) # Calculate the score
    print(f"Document Score: {score:.4f}")

Mathematical Expression

Here's the formal mathematical representation of the process:

Inputs:

Query Token Embeddings: Let $E q$ be the set of query token embeddings, where $E q = {e q_{1}, e q_{2}, \dots, e q_{Q}}$ . Each $e q_{j}$ is a $D$ -dimensional vector representing the $j_{t h}$ token of the query. $Q$ is the total number of query tokens. We can also write this as a matrix of size $Q \times D$ .
Document Patch Embeddings: Let $E d$ be the set of document patch embeddings, where $E d = {e d_{1}, e d_{2}, . . ., e d_{P}}$ . Each $e d_{i}$ is a D-dimensional vector representing the $i_{t h}$ patch of the document. $P$ is the total number of document patches (in our example, $784$ ). Also, we can write it as a matrix of size $P \times D$ .
$D$ : Represents the embedding dimension (set to $128$ in the ColPali paper).

1. Dot Product Calculation:

The dot product between a query token embedding $e q_{j}$ and a document patch embedding $e d_{i}$ is given by:

dot (e q_{j}, e d_{i}) = e q_{j} \cdot e d_{i} = \sum_{k = 1}^{D} e q_{j, k} \cdot e d_{i, k}

where

$e q_{j, k}$ is the $k_{t h}$ element of the vector for query token $j$
$e d_{i, k}$ is the $k_{t h}$ element of the vector for patch embedding $i$

2. Maximum Similarity Per Query Token:

For each query token $e q_{j}$ , find the maximum dot product across all document patch embeddings:

max_{{dot}_{j}} = {max}_{i = 1}^{P} (e q_{j} \cdot e d_{i})

This expression can also be written with max function:

max_{{dot}_{j}} = max (dot (e q_{j}, e d_{1}), dot (e q_{j}, e d_{2}), \dots, dot (e q_{j}, e d_{P}))

This finds the maximum value of all dot product values in the set, for given query vector.

3. Document Scoring:

The final document score $S (d, q)$ for document $d$ and query $q$ is calculated by summing all the maximum dot products for all query tokens :

S (d, q) = \sum_{j = 1}^{Q} max_{{dot}_{j}}

This can also be written by substituting the $max_{{dot}_{j}}$ :

S (d, q) = \sum_{j = 1}^{Q} {max}_{i = 1}^{P} (e q_{j} \cdot e d_{i})

Combined Expression:

The full mathematical expression for Late Interaction Matching and Document Scoring can be written as:

S (d, q) = \sum_{j = 1}^{Q} {max}_{i = 1}^{P} (\sum_{k = 1}^{D} e q_{j, k} \cdot e d_{i, k})

Finally, these maximum values are summed up, creating the final document score.

Conclusion

This technical report has meticulously detailed the architecture and workings of ColPali, a novel document retrieval system that effectively bridges the gap between visual and textual understanding. By utilizing the inherent contextual awareness of a language model in processing visual content, ColPali captures the complex relationships between text, tables, figures, and layouts, something that is often overlooked by text-only systems. This design approach, inspired by the ColBERT architecture, enables a computationally efficient way for fast online querying and scaling to large document collections. ColPali not only matches queries to text content but also captures implicit visual cues, thereby enhancing retrieval relevance.

ColPali’s capability is not merely a technical advancement but rather a move towards more human-like information processing. By directly “looking” at a document as a human would, ColPali can unlock previously untapped information sources. Looking ahead, ColPali has a great potential for further improvement and expansion.

Okay, let's delve deeper into the "Looking Ahead" section and brainstorm specific areas within ColPali where improvements could potentially boost performance. This will focus on concrete architectural and algorithmic enhancements.

Here's a list of potential avenues for enhancing ColPali, focusing on specific technical modifications:

Vision Encoder Enhancements:
- Exploring Newer Vision Models:
  - Rationale: Replacing the SigLIP-based vision encoder with more recent, higher-performing vision transformer models. This could improve the raw visual feature extraction.
  - Specific Ideas:
    - Investigate models like DINOv2, or models trained with higher resolutions and larger datasets.
    - Explore models that are specialized for document image understanding.
- Multi-Resolution Feature Extraction:
  - Rationale: Current approach uses a single resolution of images. Extracting features at multiple resolutions (e.g., by using a feature pyramid network, or using features from intermediate layers of vision encoders) can capture more granular detail and overall layout structure.
  - Implementation: Experiment with multi-resolution approaches to feed to contextualization layers.
Contextualization Layer Improvements:
- Advanced Attention Mechanisms:
  - Rationale: Exploring different attention mechanisms in the language model's transformer layers. The current self-attention might be suboptimal.
  - Specific Ideas:
    - Try using sparse attention, or linear attention, to make self-attention more efficient.
    - Explore hierarchical attention that learns relationships between patches at multiple levels.
    - Integrate multi-head attention mechanism for improved attention and context.
- Incorporating Positional Information Explicitly:
  - Rationale: Although patch order provides some spatial information, providing more explicit positional encodings (e.g. learnable positional encodings, 2D positional embeddings) to the language model could further improve its ability to capture layout information and visual structure.
  - Implementation: Incorporate position encoding before passing visual embeddings to the language model.
Projection Layer Optimization:
- Non-Linear Projection:
  - Rationale: Explore the potential of non-linear projection layers, instead of only the linear layer.
  - Implementation: Explore non-linear projection such as MLP or other neural network blocks in the projection layer.
Late Interaction Matching Refinements:
- Learned Interaction Weights:
  - Rationale: The current approach sums maximum dot products with equal weights for each query token. However, a better model would learn a different weight of every query token based on importance of query tokens, so not all tokens will have equal contribution to the document score.
  - Implementation: Introduce a learned weighting mechanism or trainable parameters that weights the contribution of different query tokens in the overall document score. For example, adding learnable weights on every max_dot_j before summing it up.
- Contextualized Query-Patch Interactions:
  - Rationale: The current late interaction is a simple dot product. Introduce more complex interactions, for instance, by using MLPs, to consider both query token and patch embeddings together.
  - Implementation: Implement a method to combine representations of query and patch embeddings and use an MLP for similarity calculations.
Loss Function Modifications:
- Hard-Negative Mining:
  - Rationale: Using better strategies for selecting negative samples, rather than random sampling, can improve training efficiency. Focus training on samples with large gradient.
  - Implementation: Investigate different hard-negative mining techniques such as batch-hard or contrastive hard mining.
- Margin-Based Losses:
  - Rationale: Employing a margin-based loss, instead of cross entropy loss, can help the model to learn embeddings that are better separated in the vector space. This will lead to more refined retrieval.
  - Implementation: Try different types of margin losses instead of softmaxed cross entropy loss.
Training Data Augmentation:
- Synthetic Data Generation:
  - Rationale: Create large diverse synthetic data using language models to generate various queries.
  - Implementation: Explore new methods for generating diverse synthetic data.
- Data Augmentation:
- Rationale: Use data augmentation such as rotation, noise, random crop on images to make a model more robust.
- Implementation: Implement a strategy for augmenting data.
Architectural Modifications:
- Multi-Vector Compression:
- Rationale: Find better compression method to reduce index size of multi-vector embedding.
- Implementation: Use methods such as clustering or centroid embeddings to reduce index size.

In conclusion, ColPali represents a significant stride towards document retrieval systems that can truly emulate human-like information processing. By combining the power of vision language models and late interaction mechanisms, ColPali not only addresses the limitations of prior systems, but it also sets the stage for new opportunities and applications in document retrieval. This step into the future shows that by understanding complex visual and textual context, document retrieval will have a crucial impact on how human and machine collaborate in accessing and generating knowledge. By continuing this research into visual and language model collaboration, we will be able to unlock untapped possibilities in the future of information processing.

Reference

Manuel Faysse, ColPali: Efficient Document Retrieval with Vision Language Models, arXiv:2407.01449, 7 Oct 2024(v3)