DSE Log
2024-12-11
Calling the Gemini models and successfully built the code for running
tested:
gemini-1.5-flash-latest
works ok, 60gemini-1.5-flash
not works at all, 0gemini-1.5-pro
works well, 90
Using gemini-1.5-pro
first, and using good prompt.
2024-12-12
Successfully create the pipeline for using CoPali on embedding and LanceBD on storage. Then find the similarity scores with CoPlali's original method.
Note: The embedding from model is bfloat16
but the datatype LanceBD support is float32
. We need to convert the float type here.
2025-1-13
Testing capability of batches processing images. Testing embed_images
.
Record:
gpu="l4"
Images: 81
- batch = 9
CUDA out of memory. Tried to allocate **8.88** GiB. GPU **0** has a total capacity of **21.95** GiB of
which **8.38** GiB is free. Process **2097044** has **13.56** GiB memory in use. Of the allocated memory **12.76** GiB is
allocated by PyTorch, and **587.00** MiB is reserved by PyTorch but unallocated.
- batch = 8
CUDA out of memory. Tried to allocate **7.90** GiB. GPU **0** has a total capacity of **21.95** GiB of
which **756.12** MiB is free. Process **3816024** has **21.21** GiB memory in use. Of the allocated memory **12.91** GiB is
allocated by PyTorch, and **8.07** GiB is reserved by PyTorch but unallocated.
- batch =7
Time taken for embedding: 148.9775948524475s =
gpu="l40s"
Images: 81
- batch = 7
Time taken for embedding: 52.85590386390686
- batch = 8
Time taken for embedding: 47.12998604774475
- batch = 10
Time taken for embedding: 48.47442984580994
- batch = 13
Time taken for embedding: 45.66375708580017
- batch = 20
Time taken for embedding: 48.399348974227905
- batch = 23
Time taken for embedding: 46.04530715942383
- batch = 24
CUDA out of memory. Tried to allocate **23.69** GiB. GPU **0** has a total capacity of **44.52** GiB of
which **3.33** GiB is free. Process **320620** has **41.18** GiB memory in use. Of the allocated memory **16.91** GiB is
allocated by PyTorch, and **23.78** GiB is reserved by PyTorch but unallocated.
- batch = 25
CUDA out of memory. Tried to allocate **24.67** GiB. GPU **0** has a total capacity of **44.52** GiB of
which **2.09** GiB is free. Process **93744** has **42.43** GiB memory in use. Of the allocated memory **17.16** GiB is
allocated by PyTorch, and **24.77** GiB is reserved by PyTorch but unallocated.
The format of saving Embeddings
Currently, we have two ways to store the embeddings: in 3D (e.g. [N, 1030, 128]) vs. flattened (e.g. [N, 131840]) where
Why we care about those two types?
processor.score_retrieval(queries, images_3d)
only accepts 3D format. That means if we stored embeddings in flattened, then we must need extra code to reshape it back to 3D (of course, the reshaping code does not complicated. BUT it is annoying)- What want to know which type got better performance in millions of images
- What also want to know which service is the best one for storing those embeddings
Luckily, I have a clear map to answer this question.
Memory & File-Size Considerations
Regardless of shape, you have the same total number of float values. For one image, [1030, 128] = 131,840 floats. So:
• 3D (Patchwise) Shape: [N, 1030, 128]
• Flattened Shape: [N, 131840]
In both cases, you’re storing
That said, the container format can differ slightly—e.g., if you store a 3D array ((N, 1030, 128)) in Arrow or NPZ vs. a 2D flattened array ((N, 131840)). The difference in overhead is typically small, but you might see minimal overhead from nested structures if you store a “list-of-lists” vs. a single list.
Compatibility & Ease of Use
For score_retrieval
:
If you plan to directly use a library or function that expects embeddings in [N, 1030, 128], then 3D storage is more convenient—no reshape step is needed on load. This is the typical scenario with certain ColPali or patch-based retrieval methods that want a dimension for “patches” and a dimension for the embedding size (128).
• No extra reshape; you can do processor.score_retrieval(queries, images_3d) directly.
• More intuitive if you or your library want to do patch-level manipulations or interpretations.
For vector database:
If your next step is to put embeddings into a vector database (e.g., LanceDB, Faiss, Weaviate, Milvus) for approximate nearest neighbor (ANN) search, you typically need flattened vectors so each item is a 1D array. That means a shape [N, D], where
Pros
• Straightforward to index in most vector DBs.
• Scoring or similarity search is easy because each item is just one embedding vector.
Cons
• Large dimension (131,840) is very high for typical ANN search—often you see 512, 768, 1024, or maybe a few thousand dimensions. Storing and searching across 131,840 dimensions can be both computationally heavy and memory-intensive for large N.
• You lose direct patch-level information unless your DB or pipeline is designed to handle sub-vectors.
Until now, we understand what is the real question. It is not about datatype or the way we storing the embedding. Something really matters is
Storing and searching across 131,840 dimensions' millions embeddings are computationally heavy and memory-intensive
Real Question
Efficiently compress the embedding and making high quality searching is the only way.
Todo list:
Conclusion (from o1)
• Storing in 3D [N, 1030, 128] is more convenient if your code or library directly needs patch-level shape for retrieval. You avoid reshaping on load.
• Storing in 1D [N, 131840] is standard for typical vector databases. It’s easier to upsert a 1D vector for each item.
• Performance for raw I/O is nearly the same in terms of memory usage and file size.
• Dimensionality (131,840) is quite large for large-scale retrieval, so for “millions of images,” you might do a dimensionality reduction step or store subsets of patches.
Ultimately, choose based on how you will query the embeddings:
• If you need the patch dimension at inference or with your retrieval function, go 3D.
• If you’re going into a vector DB, flatten.
Either way, for truly massive scale, chunk your data and parallelize as much as possible, since that’s usually the biggest factor in real-world performance.
2025-1-14
Tested performance of LanceDB and Native storage:
No difference in searching if you will use MaxSim eventually.
Real Problem
We need a fast way to search. Change MaxSim to some searching methods like ANN.
2025-1-20
Tested all potential combinations of vector database and vector search.
2025-1-21
Deploy in modal.
Note: Always calling inner function in modal method
CLIPPipeline().get_embed().remote()
Class + function + remote()
Write embeddings to R2 issue
If we using regular with to store the embeddings to R2 we will encounter warning like this:
[2025-01-21T22:16:12Z WARN lance_table::io::commit] Using unsafe commit handler. Concurrent writes may result in data loss. Consider providing a commit handler that prevents conflicting writes.
The reason why we got this warning is: By default, S3 does not support concurrent writes. Having two or more processes writing to the same table at the same time can lead to data corruption. This is because S3, unlike other object stores, does not have any atomic put or copy operation.
Ok, do we have solution for this? well, the answer is tricky.
Two solutions
- Implement a Commit Handler with DynamoDB:
To safely manage concurrent writes in environments like S3, LanceDB allows the configuration of an external commit handler using DynamoDB. This setup coordinates write operations, ensuring that only one process commits changes at a time, thereby preventing data corruption. To enable this feature, modify your connection URI to use the s3+ddb scheme and include the ddbTableName query parameter specifying your DynamoDB table.
This is a recommended way in LanceDB documentation here: Configuring Storage - LanceDB
- Use Cloudflare D1 or Workers KV (Cloudflare-native solution)
If we are using R2 instead of native AWS S3, we will encounter a critical problem: You cannot use DynamoDB. Only AWS S3 support DynamoDB, R2 doesn't. Thus, we have to use Cloudflare D1, but the problem is, i tried D1 cannot make it successfully running.
Alternative solution
Storing embeddings in modal volume. And make it. Now the code should be work. The embedding time, saving time, loading time is here for reference.
Loading models...
Models loaded successfully!
Found 38 images
Embedding time: 15.28s
Embedding save time: 3.41s
Embeddings saved to /embeddings/image_embeddings.parquet
Loaded 38 embeddings from /embeddings/image_embeddings.parquet
Embedding load time: 1.80s
Loaded 38 embeddings
Generated embedding for query: 'I want a deep and dark nail polish'
--- L2 Distance Metric Results ---
Search time: 2.7273s
1. Image: nail_images/image_24.jpg
2. Image: nail_images/image_38.jpg
3. Image: nail_images/image_56.jpg
4. Image: nail_images/image_75.jpg
5. Image: nail_images/image_50.jpg
--- COSINE Distance Metric Results ---
Search time: 2.6114s
1. Image: nail_images/image_24.jpg
2. Image: nail_images/image_38.jpg
3. Image: nail_images/image_56.jpg
4. Image: nail_images/image_75.jpg
5. Image: nail_images/image_50.jpg
--- DOT Distance Metric Results ---
Search time: 2.5944s
1. Image: nail_images/image_24.jpg
2. Image: nail_images/image_38.jpg
3. Image: nail_images/image_56.jpg
4. Image: nail_images/image_75.jpg
5. Image: nail_images/image_50.jpg
Total running time: 51s.
2025-1-22
Testing prompt:
- 'I want a deep and dark nail polish'
- 'I want to find a nail best fit for tonight's dinner party'
Model | Text Encoder | Tokenizer | Time | Result for P1 | Result for P2 |
---|---|---|---|---|---|
CLIP | CLIPTextModelWithProjection | AutoTokenizer: fast_use ❌ | 0.3~0.67s | Search time: 2.5058s 1. Image: nail_images/image_24.jpg 2. Image: nail_images/image_38.jpg 3. Image: nail_images/image_56.jpg 4. Image: nail_images/image_75.jpg 5. Image: nail_images/image_50.jpg |
Search time: 1.9170s 1. Image: nail_images/image_36.jpg 2. Image: nail_images/image_46.jpg 3. Image: nail_images/image_24.jpg 4. Image: nail_images/image_40.jpg 5. Image: nail_images/image_38.jpg |
CLIP | CLIPTextModelWithProjection | AutoTokenizer: fast_use ✅ used faster Rust-based tokenizer. |
0.2~0.38s | Search time: 1.2107s 1. Image: nail_images/image_24.jpg 2. Image: nail_images/image_38.jpg 3. Image: nail_images/image_56.jpg 4. Image: nail_images/image_75.jpg 5. Image: nail_images/image_50.jpg |
Search time: 0.9726s 1. Image: nail_images/image_36.jpg 2. Image: nail_images/image_46.jpg 3. Image: nail_images/image_24.jpg 4. Image: nail_images/image_40.jpg 5. Image: nail_images/image_38.jpg |
DistilBERT | DistilBertModel: distilbert-base-uncased |
AutoTokenizer: fast_use ✅ | 0.22~0.45s | Search time: 2.7699s 1. Image: nail_images/nail2.jpeg 2. Image: nail_images/nail3.jpeg 3. Image: nail_images/image_54.jpg 4. Image: nail_images/image_24.jpg 5. Image: nail_images/image_63.jpg |
Search time: 3.6317s 1. Image: nail_images/nail2.jpeg 2. Image: nail_images/nail3.jpeg 3. Image: nail_images/image_54.jpg 4. Image: nail_images/image_24.jpg 5. Image: nail_images/image_42.jpg |
laion/CLIP-ViT-H-14-laion2B-s32B-b79K | encode_text(inputs) |
open_clip.get_tokenizer |
0.75s+ | Text Dim = 1024 does not match Image Dim = 768 | Text Dim = 1024 does not match Image Dim = 768 |
Two problems
There are two things on optimizing text embedding that we need to worry: Text Embedding Dim and Search result. Now let's answer it one by one:
- Usually the text encoder always come with the image encoder, so they are technically best couple. They will share identical dimensions eg (786, 1024 ...). We can replace the text encoder if we want, but we take a big risk: misaligning on dimensions. There is not way to compute different dimensional vectors and the most importantly we cannot reduce the dimension because that will directly lose the info. Thus, we have to find those text encoders that sharing same dimensions (this is actually easy, check here). Like
DistilBERT
(check the example). - Assuming we find a text encoder that sharing same dimension with original one. There is another big problem. The result will be unpredictable. Why i am using unpredictable instead of "inaccuracy", because we don't understand why does this result is different from original one neither we don't understand why does it wrong because we don't understand the mechanism. That's dead end.
Actual Solution
DistilBERT
testing tell us even we use the fastest text encoder, we still cannot get any noticeable improvement. The real problem is we want to find a best performance and lowest latency model for user. Thus our work will pivot from "Looking for faster text encoder" to "Looking for a faster and good enough CLIP-like model"
Here are something we can work on ...
From Benchmarking Models for Multi-modal Search and CLIP Benchmarks - a Hugging Face Space by Marqo
Use case | Model | Pretrained | What it is best for |
---|---|---|---|
Fastest inference | ViT-B-32 | laion2b_s34b_b79k | When the best performance at the lowest latency/memory is required. |
Best balanced | ViT-L-14 | laion2b_s32b_b82k | When low latency is still required but with much better retrieval performance. GPU recommended. |
Best all-round | xlm-roberta-large-ViT-H-14 | frozen_laion5b_s13b_b90k | When the best performance is required. Latency is increased along with memory. GPU recommended. |
2025-1-23
Testing prompt:
- 'I want a deep and dark nail polish'
- 'I want to find a nail best fit for tonight's dinner party'
GPU_1 = RTX 4090
GPU_2 = l4
Model | Text Embedding time | Result P1 | Result P2 |
---|---|---|---|
google/siglip-base-patch16-224 | GPU_1 = 0.03s GPU_2 = 0.35s |
1. nail_images/image_34.jpg 2. nail_images/image_77.jpg 3. nail_images/nail4.jpeg 4. nail_images/image_46.jpg 5. nail_images/image_63.jpg |
1. nail_images/image_63.jpg 2. nail_images/image_67.jpg 3. nail_images/image_38.jpg 4. nail_images/image_32.jpg 5. nail_images/image_69.jpg |
openai/clip-vit-large-patch14 | GPU_1 = 0.03s GPU_2 = 0.35s |
1. nail_images/image_24.jpg 2. nail_images/image_38.jpg 3. nail_images/image_56.jpg 4. nail_images/image_75.jpg 5. nail_images/image_50.jpg |
1. nail_images/image_36.jpg 2. nail_images/image_46.jpg 3. nail_images/image_24.jpg 4. nail_images/image_40.jpg 5. nail_images/image_38.jpg |
2025-1-25
Pulling down nail images from AWS. The bucket name is nailedit-images
.
There are 6707 images. Some of them are duplicated, and some of are irrelevant. I claim that irrelevant are fine because we will not search them no matter what. But duplicated images are trick, they can be searched and hard to remove. Thus we need a tool to remove the duplicated.
Implemented raw search: embeddings.pkl
+ np.dot
; search time 0.11-0.22s
2698 images, embedding size: 23MB
2025-1-26
Clean up duplicate images using image hashing. Total duplicates removed: 3945
Total images left: Total images: 2758
Image: 2698
GPU: RTX 4090
Method | Embedding | Search | Search Time | Indexing | Indexing Time |
---|---|---|---|---|---|
Raw | embeddings.pkl =23MB |
np.dot |
0.11-0.22s | No | No |
Lance | image_embeddings.lance = 11MB |
lance cosine |
0.0092s-0.0158s | No | No |
Lance | image_embeddings_indexed.lance = 92MB |
lance cosine |
0.0034s-0.0171s | Yes | 0.0034s-0.0171s |
2025-1-27
Figure out the principal of IVF_PQ
, check report here
2025-1-28
Figure out AWS S3 + Dynomadb solved concurrent writes warning
Works
2025-1-30
Improvement of stylemi
- Trying to fix the
/dataset/
string in search output
Solution
# remove dataset prefix from paths before storing
def strip_dataset_prefix(fp: str) -> str:
prefix = "/dataset/"
return fp[len(prefix) :] if fp.startswith(prefix) else fp
embeddings_with_metadata = [
(
{
"id": strip_dataset_prefix(str(path)),
"path": strip_dataset_prefix(str(path)),
},
embedding
)
for path, embedding in image_embeddings
]
This function will effect in embedding process. It will fix all /dataset/
string before storing to volume.
- There is a potential problem. If we embedded and saved a image set, then the next new embedding will merged to modal volume. This will cause a big problem: the retrieve data will be the old data.
Solution
Modify the store_embeddings
function to
def store_embeddings(
self,
embeddings: list[tuple[dict[str, str], np.ndarray]],
table_name: str,
primary_column="id",
) -> None:
if not embeddings:
raise ValueError("No embeddings provided")
if primary_column not in embeddings[0][0]:
raise ValueError(
f"primary_column '{primary_column}' not found in embedding metadata"
)
data = [{**metadata, "vector": embedding} for metadata, embedding in embeddings]
table = self.db.create_table(
table_name,
data,
mode="overwrite",
exist_ok=True,
)
logger.info(f"Overwrote {table_name} table with {len(embeddings)} embeddings.")
This is just make sure each new embedding will overwrite.
2025-2-1
Enabled Sigilp
model
2025-2-2
Upgrading model to Sigilp-large
which has 1152 dim
query decomposition with Deepseek V3. Works good but high latency: ~5s
Changed to Google Llama 3.1 8B, less then 1s
2025-2-3
Before using Multiple Large Files Concurrently
Optimizing query: I want a colorful rainbow style
2025-02-03 22:06:30 INFO HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1/projects/codex-437109/locations/us-central1/endpoints/openapi/chat/completions?/chat/completions "HTTP/1.1 200 OK"
Decomposition time: 29.82613253593445
2025-02-03 22:06:30 INFO Decomposed query: colorful, rainbow, style, colorful nail art, bright, vibrant, multicolored
2025-02-03 22:06:32 INFO generated text embedding: torch.Size([1152])
2025-02-03 22:06:32 INFO k nearest neighbors search time: 0.07186555862426758
GET /search -> 200 OK (duration: 34.4 s, execution: 31.9 s)
Optimizing query: I want a colorful rainbow style
2025-02-03 22:06:36 INFO HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1/projects/codex-437109/locations/us-central1/endpoints/openapi/chat/completions?/chat/completions "HTTP/1.1 200 OK"
Decomposition time: 1.6866669654846191
2025-02-03 22:06:36 INFO Decomposed query: colorful, rainbow, style, colorful nail art, bright, vibrant, multicolored
2025-02-03 22:06:37 INFO generated text embedding: torch.Size([1152])
2025-02-03 22:06:37 INFO k nearest neighbors search time: 0.024363279342651367
GET /search -> 200 OK (duration: 2.45 s, execution: 2.39 s)
Optimizing query: I want a colorful rainbow style
2025-02-03 22:06:40 INFO HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1/projects/codex-437109/locations/us-central1/endpoints/openapi/chat/completions?/chat/completions "HTTP/1.1 200 OK"
Decomposition time: 0.818831205368042
Using Multiple Large Files Concurrently
Optimizing query: I want a colorful rainbow style
2025-02-03 22:23:39 INFO HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1/projects/codex-437109/locations/us-central1/endpoints/openapi/chat/completions?/chat/completions "HTTP/1.1 200 OK"
2025-02-03 22:23:39 INFO Decomposed query: colorful, rainbow, style, colorful nail art, bright, vibrant, multicolored
Decomposition time: 20.199907064437866
2025-02-03 22:23:41 INFO generated text embedding: torch.Size([1152])
2025-02-03 22:23:41 INFO k nearest neighbors search time: 0.14199233055114746
GET /search -> 200 OK (duration: 24.5 s, execution: 21.8 s)
...
Optimizing query: I want a colorful rainbow style
2025-02-03 22:25:33 INFO HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1/projects/codex-437109/locations/us-central1/endpoints/openapi/chat/completions?/chat/completions "HTTP/1.1 200 OK"
Decomposition time: 25.662290334701538
2025-02-03 22:25:33 INFO Decomposed query: colorful, rainbow, style, colorful nail art, bright, vibrant, multicolored
Summary of 25s for initial run
Below is a timeline that summarizes the costs you’re seeing on the very first remote call to decompose_query:
- Container Spin-Up (Cold Start – ~20 seconds):
What Happens:
The first time you call a remote method on EmbeddingsInference, Modal must start a new container from your pre-built inference_image. This involves pulling (if needed) and initializing the container runtime, which can take around 20 seconds on a cold start.
• Impact:
This overhead is incurred only on the first invocation (or whenever the container has been idled out).
- In-Container Initialization (≈6 seconds total):
load_models (≈4.7 seconds):
- Runs with @modal.enter(snap=True)
- Loads models (text model, tokenizer, processor) concurrently on the CPU.
setup (≈1.35 seconds):
- Runs with @modal.enter(snap=False)
- Moves the text model to GPU, sets up credentials, creates the LLM client, etc.
Combined Cost:
These steps add up to about 6 seconds. This initialization happens every time a fresh container is started.
- Execution of decompose_query (≈2–3 seconds):
What Happens:
After initialization, the actual logic in decompose_query runs, which includes making an HTTP POST to the LLM endpoint.
Observed Time:
The HTTP request returns with a 200 OK, and its execution (the network round-trip and processing by the remote LLM API) takes roughly 2–3 seconds.
- Overall First-Call Timeline:
- Container Spin-Up: ~20 seconds
- In-Container Initialization: ~6 seconds
- LLM API Call (inside decompose_query): ~2–3 seconds
- Total Observed Time: ~28–29 seconds
Subsequent Calls:
Once the container is up and running (i.e., warmed up), subsequent calls won’t incur the 20-second spin-up delay. They’ll only need to run the already-initialized code and the LLM call, resulting in much faster responses.
Why
The results you're seeing suggest that while concurrent loading and memory snapshots are helping, the improvements are not as significant as expected. Let's analyze why this might be happening and explore additional optimizations.
Why the Results Are Not as Expected
-
Snapshot Overhead:
- Memory snapshots reduce cold start times by caching the state of the container after initialization. However, creating a snapshot adds some overhead during the first run (when the snapshot is being created).
- If your initialization time is dominated by GPU operations (e.g., moving models to GPU), snapshots won't help much because GPU memory cannot be snapshotted.
-
Concurrent Loading Limitations:
- Concurrent loading works best when the bottleneck is I/O-bound (e.g., reading files from disk or downloading models). If the bottleneck is CPU-bound or GPU-bound (e.g., model initialization or moving data to GPU), concurrent loading won't provide significant benefits.
-
GPU Initialization:
- Moving models to the GPU (
model.to("cuda")
) is a blocking operation and cannot be parallelized. This step often dominates the initialization time, especially for large models.
- Moving models to the GPU (
-
Snapshot Compatibility:
- If your initialization includes GPU-specific operations (e.g., checking CUDA availability or initializing GPU memory), these operations cannot be snapshotted. This limits the effectiveness of memory snapshots.
Additional Optimizations
Here are some additional techniques to further reduce cold start times:
1. Pre-Warm Containers
Use the keep_warm
parameter to maintain a pool of warm containers. This ensures that some containers are always ready to handle requests, reducing the need for cold starts.
@app.cls(
gpu="L4",
enable_memory_snapshot=True,
keep_warm=1, # Keep 1 container warm at all times
...
)
class EmbeddingsInference:
...
2. Optimize GPU Initialization
If moving models to the GPU is the bottleneck, consider:
- Lazy Initialization: Only move models to the GPU when they are first used.
- Smaller Models: Use smaller models or quantized versions if possible.
Example of lazy initialization:
class EmbeddingsInference:
@modal.enter(snap=True)
def load_models(self):
# Load models into CPU memory
self.text_model = SiglipTextModel.from_pretrained(MODEL_ID, device="cpu")
self.tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
self.processor = AutoProcessor.from_pretrained(MODEL_ID)
@modal.enter(snap=False)
def setup(self):
self.device = torch.device("cuda")
torch.set_default_device(self.device)
@modal.method()
def get_text_embedding(self, text: str) -> np.ndarray:
# Lazy move to GPU
if not hasattr(self, "text_model_gpu"):
self.text_model_gpu = self.text_model.to(self.device)
inputs = self.tokenizer(text, return_tensors="pt", padding=True).to(self.device)
outputs = self.text_model_gpu(**inputs)
embedding = self.normalize(outputs.pooler_output).squeeze(0)
return embedding.cpu().tolist()