What is Vision LLM

Understanding Vision LLMs and Document Screenshot Embedding (DSE)

Initial Confusion Points

Vision LLM Input Processing
- Initially confused about whether vLLM takes raw images or vectors
- Unclear about how text and images are processed together
- Wondered if vectors get converted back to tokens
DSE Architecture
- Wasn't sure how document screenshots are processed
- Confused about why need separate query encoder
- Questions about how vectors are matched for retrieval

Key Insights and Clarifications

1. Vision LLM Internal Working

Text Processing Path

Text -> Tokenizer -> Token IDs -> Embedding Vectors -> Transformer

Text needs to be tokenized first
Tokens are converted to vectors (embeddings)
Transformer processes these vectors

Image Processing Path

Image -> Vision Encoder -> Vectors -> Transformer

No tokenization needed for images
Vision encoder (like CLIP ViT) directly creates vectors
Each image patch becomes a vector

Critical Understanding

Transformer always processes vectors, never raw images/text
Both text and image end up as vectors in same space
No conversion back to tokens inside transformer
Vision encoder is like tokenizer for images

2. DSE Architecture Understanding

Document Processing

Query Processing

Text Query -> Tokenizer -> Language Model (shared weights) -> Query Embedding

Why This Works

Unified representation
- Both documents and queries become vectors
- Same language model processes both
- Enables direct similarity comparison
Efficient Design
- No parsing of different document formats
- Preserves layout and visual information
- Fast retrieval using vector similarity

Learning Journey Summary

Started with confusion about vLLM internals
Breakthrough: Understanding everything becomes vectors
Key realization: Transformer only sees vectors
Final clarity: DSE leverages this for unified document processing

Practical Implications

For Vision LLMs:
- Can process any input that's convertible to vectors
- Text and images can be processed uniformly
- No fundamental difference once converted to vectors
For Document Processing:
- Screenshots can be universal input format
- No need for complex parsing
- Can handle mixed modality content naturally

Remember

Vision LLMs internally convert everything to vectors
Transformers work with sequences of vectors
DSE uses this principle for unified document processing
There's no "magic" conversion between modalities - it's all vector operations