DSE Log

2024-12-11

Calling the Gemini models and successfully built the code for running

tested:

Using gemini-1.5-pro first, and using good prompt.

2024-12-12

Successfully create the pipeline for using CoPali on embedding and LanceBD on storage. Then find the similarity scores with CoPlali's original method.

Note: The embedding from model is bfloat16 but the datatype LanceBD support is float32. We need to convert the float type here.

2025-1-13

Testing capability of batches processing images. Testing embed_images.

Record:

gpu="l4"

Images: 81

CUDA out of memory. Tried to allocate **8.88** GiB. GPU **0** has a total capacity of **21.95** GiB of 

which **8.38** GiB is free. Process **2097044** has **13.56** GiB memory in use. Of the allocated memory **12.76** GiB is 

allocated by PyTorch, and **587.00** MiB is reserved by PyTorch but unallocated.
CUDA out of memory. Tried to allocate **7.90** GiB. GPU **0** has a total capacity of **21.95** GiB of 

which **756.12** MiB is free. Process **3816024** has **21.21** GiB memory in use. Of the allocated memory **12.91** GiB is 

allocated by PyTorch, and **8.07** GiB is reserved by PyTorch but unallocated.
Time taken for embedding: 148.9775948524475s = 

gpu="l40s"

Images: 81

Time taken for embedding: 52.85590386390686
Time taken for embedding: 47.12998604774475
Time taken for embedding: 48.47442984580994
Time taken for embedding: 45.66375708580017
Time taken for embedding: 48.399348974227905
Time taken for embedding: 46.04530715942383
CUDA out of memory. Tried to allocate **23.69** GiB. GPU **0** has a total capacity of **44.52** GiB of 

which **3.33** GiB is free. Process **320620** has **41.18** GiB memory in use. Of the allocated memory **16.91** GiB is 

allocated by PyTorch, and **23.78** GiB is reserved by PyTorch but unallocated.
CUDA out of memory. Tried to allocate **24.67** GiB. GPU **0** has a total capacity of **44.52** GiB of 

which **2.09** GiB is free. Process **93744** has **42.43** GiB memory in use. Of the allocated memory **17.16** GiB is 

allocated by PyTorch, and **24.77** GiB is reserved by PyTorch but unallocated.

The format of saving Embeddings

Currently, we have two ways to store the embeddings: in 3D (e.g. [N, 1030, 128]) vs. flattened (e.g. [N, 131840]) where 131840=1030×128.

Why we care about those two types?

  1. processor.score_retrieval(queries, images_3d) only accepts 3D format. That means if we stored embeddings in flattened, then we must need extra code to reshape it back to 3D (of course, the reshaping code does not complicated. BUT it is annoying)
  2. What want to know which type got better performance in millions of images
  3. What also want to know which service is the best one for storing those embeddings

Luckily, I have a clear map to answer this question.

Memory & File-Size Considerations

Regardless of shape, you have the same total number of float values. For one image, [1030, 128] = 131,840 floats. So:

3D (Patchwise) Shape: [N, 1030, 128]
Flattened Shape: [N, 131840]

In both cases, you’re storing N×131,840 total floats. Hence, in terms of raw memory or file size (on disk/S3) for a single data type (e.g., float16 vs. float32), there is no difference in total storage requirement.

That said, the container format can differ slightly—e.g., if you store a 3D array ((N, 1030, 128)) in Arrow or NPZ vs. a 2D flattened array ((N, 131840)). The difference in overhead is typically small, but you might see minimal overhead from nested structures if you store a “list-of-lists” vs. a single list.

Compatibility & Ease of Use

For score_retrieval:
If you plan to directly use a library or function that expects embeddings in [N, 1030, 128], then 3D storage is more convenient—no reshape step is needed on load. This is the typical scenario with certain ColPali or patch-based retrieval methods that want a dimension for “patches” and a dimension for the embedding size (128).

• No extra reshape; you can do processor.score_retrieval(queries, images_3d) directly.
• More intuitive if you or your library want to do patch-level manipulations or interpretations.

For vector database:
If your next step is to put embeddings into a vector database (e.g., LanceDB, Faiss, Weaviate, Milvus) for approximate nearest neighbor (ANN) search, you typically need flattened vectors so each item is a 1D array. That means a shape [N, D], where D=131840 in your case.

Pros
• Straightforward to index in most vector DBs.
• Scoring or similarity search is easy because each item is just one embedding vector.

Cons
• Large dimension (131,840) is very high for typical ANN search—often you see 512, 768, 1024, or maybe a few thousand dimensions. Storing and searching across 131,840 dimensions can be both computationally heavy and memory-intensive for large N.
• You lose direct patch-level information unless your DB or pipeline is designed to handle sub-vectors.

Until now, we understand what is the real question. It is not about datatype or the way we storing the embedding. Something really matters is

Storing and searching across 131,840 dimensions' millions embeddings are computationally heavy and memory-intensive

Real Question

Efficiently compress the embedding and making high quality searching is the only way.

Todo list:

Conclusion (from o1)

Storing in 3D [N, 1030, 128] is more convenient if your code or library directly needs patch-level shape for retrieval. You avoid reshaping on load.
Storing in 1D [N, 131840] is standard for typical vector databases. It’s easier to upsert a 1D vector for each item.
Performance for raw I/O is nearly the same in terms of memory usage and file size.
Dimensionality (131,840) is quite large for large-scale retrieval, so for “millions of images,” you might do a dimensionality reduction step or store subsets of patches.

Ultimately, choose based on how you will query the embeddings:

• If you need the patch dimension at inference or with your retrieval function, go 3D.
• If you’re going into a vector DB, flatten.

Either way, for truly massive scale, chunk your data and parallelize as much as possible, since that’s usually the biggest factor in real-world performance.

2025-1-14

Tested performance of LanceDB and Native storage:

No difference in searching if you will use MaxSim eventually.

Real Problem

We need a fast way to search. Change MaxSim to some searching methods like ANN.

Image Input Upload & Process Model Generate Embeddings Lance Format Vector Storage R2 Storage S3 Compatible embedding.lance Process Convert Store Retrieve

2025-1-20

Tested all potential combinations of vector database and vector search.

2025-1-21

Deploy in modal.

Note: Always calling inner function in modal method

CLIPPipeline().get_embed().remote()

Class + function + remote()

Write embeddings to R2 issue

If we using regular with to store the embeddings to R2 we will encounter warning like this:

[2025-01-21T22:16:12Z WARN lance_table::io::commit] Using unsafe commit handler. Concurrent writes may result in data loss. Consider providing a commit handler that prevents conflicting writes.

The reason why we got this warning is: By default, S3 does not support concurrent writes. Having two or more processes writing to the same table at the same time can lead to data corruption. This is because S3, unlike other object stores, does not have any atomic put or copy operation.

Ok, do we have solution for this? well, the answer is tricky.

Two solutions

  1. Implement a Commit Handler with DynamoDB:
    To safely manage concurrent writes in environments like S3, LanceDB allows the configuration of an external commit handler using DynamoDB. This setup coordinates write operations, ensuring that only one process commits changes at a time, thereby preventing data corruption. To enable this feature, modify your connection URI to use the s3+ddb scheme and include the ddbTableName query parameter specifying your DynamoDB table.

This is a recommended way in LanceDB documentation here: Configuring Storage - LanceDB

  1. Use Cloudflare D1 or Workers KV (Cloudflare-native solution)

If we are using R2 instead of native AWS S3, we will encounter a critical problem: You cannot use DynamoDB. Only AWS S3 support DynamoDB, R2 doesn't. Thus, we have to use Cloudflare D1, but the problem is, i tried D1 cannot make it successfully running.

Alternative solution

Storing embeddings in modal volume. And make it. Now the code should be work. The embedding time, saving time, loading time is here for reference.

Loading models...
Models loaded successfully!
Found 38 images
Embedding time: 15.28s
Embedding save time: 3.41s
Embeddings saved to /embeddings/image_embeddings.parquet
Loaded 38 embeddings from /embeddings/image_embeddings.parquet
Embedding load time: 1.80s
Loaded 38 embeddings

Generated embedding for query: 'I want a deep and dark nail polish'

--- L2 Distance Metric Results ---

Search time: 2.7273s
1. Image: nail_images/image_24.jpg
2. Image: nail_images/image_38.jpg
3. Image: nail_images/image_56.jpg
4. Image: nail_images/image_75.jpg
5. Image: nail_images/image_50.jpg


--- COSINE Distance Metric Results ---
Search time: 2.6114s
1. Image: nail_images/image_24.jpg
2. Image: nail_images/image_38.jpg
3. Image: nail_images/image_56.jpg
4. Image: nail_images/image_75.jpg
5. Image: nail_images/image_50.jpg

--- DOT Distance Metric Results ---
Search time: 2.5944s
1. Image: nail_images/image_24.jpg
2. Image: nail_images/image_38.jpg
3. Image: nail_images/image_56.jpg
4. Image: nail_images/image_75.jpg
5. Image: nail_images/image_50.jpg

Total running time: 51s.

2025-1-22

Testing prompt:

  1. 'I want a deep and dark nail polish'
  2. 'I want to find a nail best fit for tonight's dinner party'
Model Text Encoder Tokenizer Time Result for P1 Result for P2
CLIP CLIPTextModelWithProjection AutoTokenizer: fast_use ❌ 0.3~0.67s Search time: 2.5058s

1. Image: nail_images/image_24.jpg

2. Image: nail_images/image_38.jpg

3. Image: nail_images/image_56.jpg

4. Image: nail_images/image_75.jpg

5. Image: nail_images/image_50.jpg
Search time: 1.9170s

1. Image: nail_images/image_36.jpg

2. Image: nail_images/image_46.jpg

3. Image: nail_images/image_24.jpg

4. Image: nail_images/image_40.jpg

5. Image: nail_images/image_38.jpg
CLIP CLIPTextModelWithProjection AutoTokenizer: fast_use ✅

used faster Rust-based tokenizer.
0.2~0.38s Search time: 1.2107s

1. Image: nail_images/image_24.jpg

2. Image: nail_images/image_38.jpg

3. Image: nail_images/image_56.jpg

4. Image: nail_images/image_75.jpg

5. Image: nail_images/image_50.jpg
Search time: 0.9726s

1. Image: nail_images/image_36.jpg

2. Image: nail_images/image_46.jpg

3. Image: nail_images/image_24.jpg

4. Image: nail_images/image_40.jpg

5. Image: nail_images/image_38.jpg
DistilBERT DistilBertModel: distilbert-base-uncased AutoTokenizer: fast_use ✅ 0.22~0.45s Search time: 2.7699s

1. Image: nail_images/nail2.jpeg

2. Image: nail_images/nail3.jpeg

3. Image: nail_images/image_54.jpg

4. Image: nail_images/image_24.jpg

5. Image: nail_images/image_63.jpg
Search time: 3.6317s

1. Image: nail_images/nail2.jpeg

2. Image: nail_images/nail3.jpeg

3. Image: nail_images/image_54.jpg

4. Image: nail_images/image_24.jpg

5. Image: nail_images/image_42.jpg
laion/CLIP-ViT-H-14-laion2B-s32B-b79K encode_text(inputs) open_clip.get_tokenizer 0.75s+ Text Dim = 1024 does not match Image Dim = 768 Text Dim = 1024 does not match Image Dim = 768

Two problems

There are two things on optimizing text embedding that we need to worry: Text Embedding Dim and Search result. Now let's answer it one by one:

  1. Usually the text encoder always come with the image encoder, so they are technically best couple. They will share identical dimensions eg (786, 1024 ...). We can replace the text encoder if we want, but we take a big risk: misaligning on dimensions. There is not way to compute different dimensional vectors and the most importantly we cannot reduce the dimension because that will directly lose the info. Thus, we have to find those text encoders that sharing same dimensions (this is actually easy, check here). Like DistilBERT(check the example).
  2. Assuming we find a text encoder that sharing same dimension with original one. There is another big problem. The result will be unpredictable. Why i am using unpredictable instead of "inaccuracy", because we don't understand why does this result is different from original one neither we don't understand why does it wrong because we don't understand the mechanism. That's dead end.

Actual Solution

DistilBERT testing tell us even we use the fastest text encoder, we still cannot get any noticeable improvement. The real problem is we want to find a best performance and lowest latency model for user. Thus our work will pivot from "Looking for faster text encoder" to "Looking for a faster and good enough CLIP-like model"

Here are something we can work on ...
From Benchmarking Models for Multi-modal Search and CLIP Benchmarks - a Hugging Face Space by Marqo

Use case Model Pretrained What it is best for
Fastest inference ViT-B-32 laion2b_s34b_b79k When the best performance at the lowest latency/memory is required.
Best balanced ViT-L-14 laion2b_s32b_b82k When low latency is still required but with much better retrieval performance. GPU recommended.
Best all-round xlm-roberta-large-ViT-H-14 frozen_laion5b_s13b_b90k When the best performance is required. Latency is increased along with memory. GPU recommended.

2025-1-23

Testing prompt:

  1. 'I want a deep and dark nail polish'
  2. 'I want to find a nail best fit for tonight's dinner party'

GPU_1 = RTX 4090
GPU_2 = l4

Model Text Embedding time Result P1 Result P2
google/siglip-base-patch16-224 GPU_1 = 0.03s
GPU_2 = 0.35s
1. nail_images/image_34.jpg
2. nail_images/image_77.jpg
3. nail_images/nail4.jpeg
4. nail_images/image_46.jpg
5. nail_images/image_63.jpg
1. nail_images/image_63.jpg
2. nail_images/image_67.jpg
3. nail_images/image_38.jpg
4. nail_images/image_32.jpg
5. nail_images/image_69.jpg
openai/clip-vit-large-patch14 GPU_1 = 0.03s
GPU_2 = 0.35s
1. nail_images/image_24.jpg
2. nail_images/image_38.jpg
3. nail_images/image_56.jpg
4. nail_images/image_75.jpg
5. nail_images/image_50.jpg
1. nail_images/image_36.jpg
2. nail_images/image_46.jpg
3. nail_images/image_24.jpg
4. nail_images/image_40.jpg
5. nail_images/image_38.jpg

2025-1-25

Pulling down nail images from AWS. The bucket name is nailedit-images.

There are 6707 images. Some of them are duplicated, and some of are irrelevant. I claim that irrelevant are fine because we will not search them no matter what. But duplicated images are trick, they can be searched and hard to remove. Thus we need a tool to remove the duplicated.

Implemented raw search: embeddings.pkl + np.dot ; search time 0.11-0.22s

2698 images, embedding size: 23MB

2025-1-26

Clean up duplicate images using image hashing. Total duplicates removed: 3945

Total images left: Total images: 2758

Image: 2698
GPU: RTX 4090

Method Embedding Search Search Time Indexing Indexing Time
Raw embeddings.pkl=23MB np.dot 0.11-0.22s No No
Lance image_embeddings.lance= 11MB lance cosine 0.0092s-0.0158s No No
Lance image_embeddings_indexed.lance = 92MB lance cosine 0.0034s-0.0171s Yes 0.0034s-0.0171s

2025-1-27

Figure out the principal of IVF_PQ, check report here

IVF-PQ Outline

2025-1-28

Figure out AWS S3 + Dynomadb solved concurrent writes warning
Works

2025-1-30

Improvement of stylemi

Solution

# remove dataset prefix from paths before storing
def strip_dataset_prefix(fp: str) -> str:
        prefix = "/dataset/"
        return fp[len(prefix) :] if fp.startswith(prefix) else fp

embeddings_with_metadata = [
    (
        {
            "id": strip_dataset_prefix(str(path)),
            "path": strip_dataset_prefix(str(path)),
        },
        embedding
    )
    for path, embedding in image_embeddings
]

This function will effect in embedding process. It will fix all /dataset/ string before storing to volume.

Solution

Modify the store_embeddings function to

def store_embeddings(
    self,
    embeddings: list[tuple[dict[str, str], np.ndarray]],
    table_name: str,
    primary_column="id",
) -> None:
    if not embeddings:
        raise ValueError("No embeddings provided")

    if primary_column not in embeddings[0][0]:
        raise ValueError(
            f"primary_column '{primary_column}' not found in embedding metadata"
        )

    data = [{**metadata, "vector": embedding} for metadata, embedding in embeddings]

    table = self.db.create_table(
            table_name,
            data,
            mode="overwrite",
            exist_ok=True,
        )
    logger.info(f"Overwrote {table_name} table with {len(embeddings)} embeddings.")

This is just make sure each new embedding will overwrite.

2025-2-1

Enabled Sigilp model

2025-2-2

Upgrading model to Sigilp-large which has 1152 dim

query decomposition with Deepseek V3. Works good but high latency: ~5s

Changed to Google Llama 3.1 8B, less then 1s

2025-2-3

Before using Multiple Large Files Concurrently

Optimizing query: I want a colorful rainbow style

2025-02-03 22:06:30 INFO HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1/projects/codex-437109/locations/us-central1/endpoints/openapi/chat/completions?/chat/completions "HTTP/1.1 200 OK"

Decomposition time: 29.82613253593445

2025-02-03 22:06:30 INFO Decomposed query: colorful, rainbow, style, colorful nail art, bright, vibrant, multicolored

2025-02-03 22:06:32 INFO generated text embedding: torch.Size([1152])

2025-02-03 22:06:32 INFO k nearest neighbors search time: 0.07186555862426758

    GET /search -> 200 OK  (duration: 34.4 s, execution: 31.9 s)

Optimizing query: I want a colorful rainbow style

2025-02-03 22:06:36 INFO HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1/projects/codex-437109/locations/us-central1/endpoints/openapi/chat/completions?/chat/completions "HTTP/1.1 200 OK"

Decomposition time: 1.6866669654846191

2025-02-03 22:06:36 INFO Decomposed query: colorful, rainbow, style, colorful nail art, bright, vibrant, multicolored

2025-02-03 22:06:37 INFO generated text embedding: torch.Size([1152])

2025-02-03 22:06:37 INFO k nearest neighbors search time: 0.024363279342651367

    GET /search -> 200 OK  (duration: 2.45 s, execution: 2.39 s)

Optimizing query: I want a colorful rainbow style

2025-02-03 22:06:40 INFO HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1/projects/codex-437109/locations/us-central1/endpoints/openapi/chat/completions?/chat/completions "HTTP/1.1 200 OK"

Decomposition time: 0.818831205368042

Using Multiple Large Files Concurrently

Optimizing query: I want a colorful rainbow style

2025-02-03 22:23:39 INFO HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1/projects/codex-437109/locations/us-central1/endpoints/openapi/chat/completions?/chat/completions "HTTP/1.1 200 OK"

2025-02-03 22:23:39 INFO Decomposed query: colorful, rainbow, style, colorful nail art, bright, vibrant, multicolored

Decomposition time: 20.199907064437866

2025-02-03 22:23:41 INFO generated text embedding: torch.Size([1152])

2025-02-03 22:23:41 INFO k nearest neighbors search time: 0.14199233055114746

    GET /search -> 200 OK  (duration: 24.5 s, execution: 21.8 s)
    
...

Optimizing query: I want a colorful rainbow style

2025-02-03 22:25:33 INFO HTTP Request: POST https://us-central1-aiplatform.googleapis.com/v1/projects/codex-437109/locations/us-central1/endpoints/openapi/chat/completions?/chat/completions "HTTP/1.1 200 OK"

Decomposition time: 25.662290334701538

2025-02-03 22:25:33 INFO Decomposed query: colorful, rainbow, style, colorful nail art, bright, vibrant, multicolored

Summary of 25s for initial run

Below is a timeline that summarizes the costs you’re seeing on the very first remote call to decompose_query:

  1. Container Spin-Up (Cold Start – ~20 seconds):

What Happens:

The first time you call a remote method on EmbeddingsInference, Modal must start a new container from your pre-built inference_image. This involves pulling (if needed) and initializing the container runtime, which can take around 20 seconds on a cold start.

Impact:

This overhead is incurred only on the first invocation (or whenever the container has been idled out).

  1. In-Container Initialization (≈6 seconds total):

load_models (≈4.7 seconds):

setup (≈1.35 seconds):

Combined Cost:

These steps add up to about 6 seconds. This initialization happens every time a fresh container is started.

  1. Execution of decompose_query (≈2–3 seconds):

What Happens:

After initialization, the actual logic in decompose_query runs, which includes making an HTTP POST to the LLM endpoint.

Observed Time:

The HTTP request returns with a 200 OK, and its execution (the network round-trip and processing by the remote LLM API) takes roughly 2–3 seconds.

  1. Overall First-Call Timeline:

Subsequent Calls:

Once the container is up and running (i.e., warmed up), subsequent calls won’t incur the 20-second spin-up delay. They’ll only need to run the already-initialized code and the LLM call, resulting in much faster responses.

Why

The results you're seeing suggest that while concurrent loading and memory snapshots are helping, the improvements are not as significant as expected. Let's analyze why this might be happening and explore additional optimizations.


Why the Results Are Not as Expected

  1. Snapshot Overhead:

    • Memory snapshots reduce cold start times by caching the state of the container after initialization. However, creating a snapshot adds some overhead during the first run (when the snapshot is being created).
    • If your initialization time is dominated by GPU operations (e.g., moving models to GPU), snapshots won't help much because GPU memory cannot be snapshotted.
  2. Concurrent Loading Limitations:

    • Concurrent loading works best when the bottleneck is I/O-bound (e.g., reading files from disk or downloading models). If the bottleneck is CPU-bound or GPU-bound (e.g., model initialization or moving data to GPU), concurrent loading won't provide significant benefits.
  3. GPU Initialization:

    • Moving models to the GPU (model.to("cuda")) is a blocking operation and cannot be parallelized. This step often dominates the initialization time, especially for large models.
  4. Snapshot Compatibility:

    • If your initialization includes GPU-specific operations (e.g., checking CUDA availability or initializing GPU memory), these operations cannot be snapshotted. This limits the effectiveness of memory snapshots.

Additional Optimizations

Here are some additional techniques to further reduce cold start times:


1. Pre-Warm Containers

Use the keep_warm parameter to maintain a pool of warm containers. This ensures that some containers are always ready to handle requests, reducing the need for cold starts.

@app.cls(
    gpu="L4",
    enable_memory_snapshot=True,
    keep_warm=1,  # Keep 1 container warm at all times
    ...
)
class EmbeddingsInference:
    ...

2. Optimize GPU Initialization

If moving models to the GPU is the bottleneck, consider:

Example of lazy initialization:

class EmbeddingsInference:
    @modal.enter(snap=True)
    def load_models(self):
        # Load models into CPU memory
        self.text_model = SiglipTextModel.from_pretrained(MODEL_ID, device="cpu")
        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
        self.processor = AutoProcessor.from_pretrained(MODEL_ID)

    @modal.enter(snap=False)
    def setup(self):
        self.device = torch.device("cuda")
        torch.set_default_device(self.device)

    @modal.method()
    def get_text_embedding(self, text: str) -> np.ndarray:
        # Lazy move to GPU
        if not hasattr(self, "text_model_gpu"):
            self.text_model_gpu = self.text_model.to(self.device)

        inputs = self.tokenizer(text, return_tensors="pt", padding=True).to(self.device)
        outputs = self.text_model_gpu(**inputs)
        embedding = self.normalize(outputs.pooler_output).squeeze(0)
        return embedding.cpu().tolist()