What is Vision LLM
Understanding Vision LLMs and Document Screenshot Embedding (DSE)
Initial Confusion Points
-
Vision LLM Input Processing
- Initially confused about whether vLLM takes raw images or vectors
- Unclear about how text and images are processed together
- Wondered if vectors get converted back to tokens
-
DSE Architecture
- Wasn't sure how document screenshots are processed
- Confused about why need separate query encoder
- Questions about how vectors are matched for retrieval
Key Insights and Clarifications
1. Vision LLM Internal Working
Text Processing Path
Text -> Tokenizer -> Token IDs -> Embedding Vectors -> Transformer
- Text needs to be tokenized first
- Tokens are converted to vectors (embeddings)
- Transformer processes these vectors
Image Processing Path
Image -> Vision Encoder -> Vectors -> Transformer
- No tokenization needed for images
- Vision encoder (like CLIP ViT) directly creates vectors
- Each image patch becomes a vector
Critical Understanding
- Transformer always processes vectors, never raw images/text
- Both text and image end up as vectors in same space
- No conversion back to tokens inside transformer
- Vision encoder is like tokenizer for images
2. DSE Architecture Understanding
Document Processing
Query Processing
Text Query -> Tokenizer -> Language Model (shared weights) -> Query Embedding
Why This Works
-
Unified representation
- Both documents and queries become vectors
- Same language model processes both
- Enables direct similarity comparison
-
Efficient Design
- No parsing of different document formats
- Preserves layout and visual information
- Fast retrieval using vector similarity
Learning Journey Summary
- Started with confusion about vLLM internals
- Breakthrough: Understanding everything becomes vectors
- Key realization: Transformer only sees vectors
- Final clarity: DSE leverages this for unified document processing
Practical Implications
-
For Vision LLMs:
- Can process any input that's convertible to vectors
- Text and images can be processed uniformly
- No fundamental difference once converted to vectors
-
For Document Processing:
- Screenshots can be universal input format
- No need for complex parsing
- Can handle mixed modality content naturally
Remember
- Vision LLMs internally convert everything to vectors
- Transformers work with sequences of vectors
- DSE uses this principle for unified document processing
- There's no "magic" conversion between modalities - it's all vector operations