What is Vision LLM

Understanding Vision LLMs and Document Screenshot Embedding (DSE)

Initial Confusion Points

  1. Vision LLM Input Processing

    • Initially confused about whether vLLM takes raw images or vectors
    • Unclear about how text and images are processed together
    • Wondered if vectors get converted back to tokens
  2. DSE Architecture

    • Wasn't sure how document screenshots are processed
    • Confused about why need separate query encoder
    • Questions about how vectors are matched for retrieval

Key Insights and Clarifications

1. Vision LLM Internal Working

Text Processing Path

Text -> Tokenizer -> Token IDs -> Embedding Vectors -> Transformer

Image Processing Path

Image -> Vision Encoder -> Vectors -> Transformer

Critical Understanding

2. DSE Architecture Understanding

Document Processing

Screenshot 4 Sub-images (2x2) Patch Sequences 24x24 patches per sub-image Vision Encoder Patch Embeddings Language Model Processing "What is shown in this image?" Combined Processing of Patch Embeddings + Prompt Final Document Embedding

Query Processing

Text Query -> Tokenizer -> Language Model (shared weights) -> Query Embedding

Why This Works

  1. Unified representation

    • Both documents and queries become vectors
    • Same language model processes both
    • Enables direct similarity comparison
  2. Efficient Design

    • No parsing of different document formats
    • Preserves layout and visual information
    • Fast retrieval using vector similarity

Learning Journey Summary

  1. Started with confusion about vLLM internals
  2. Breakthrough: Understanding everything becomes vectors
  3. Key realization: Transformer only sees vectors
  4. Final clarity: DSE leverages this for unified document processing

Practical Implications

  1. For Vision LLMs:

    • Can process any input that's convertible to vectors
    • Text and images can be processed uniformly
    • No fundamental difference once converted to vectors
  2. For Document Processing:

    • Screenshots can be universal input format
    • No need for complex parsing
    • Can handle mixed modality content naturally

Remember