Modal Cost for Alex

1. GPU Running Time

GPU Tasks Costs
Nvidia H100 $0.001267 / sec
Nvidia A100, 80 GB $0.000944 / sec
Nvidia A100, 40 GB $0.000772 / sec
Nvidia L40S $0.000542 / sec
Nvidia A10G $0.000306 / sec
Nvidia L4 $0.000222 / sec
Nvidia T4 $0.000164 / sec

2. CPU Running Time

CPU Physical core (2 vCPU equivalent) $0.000038 / core / sec

3. Memory Allocation

Memory $0.00000667 / GiB / sec

4. Cloud Storage & Data Transfer

5. Idle Container Time (container_idle_timeout)

Testing Data

get_image :

$0.40 takes 347.9s + 17.06s

get_image_urns:

Reserved 16GB Memory Used: 1.06GB
Reserved 4 core CPU Used: 0.11 Core
GPU Memory Used: 16.40GB

Cost: 23mins 17s

In total: $1.42 = $0.92 (L40S) + $0.27 (CPU) + $0.19 (Memory)

from D "cost me like 4.50 and 19min for the 2.7k images in H100"

save_image_embedding

takes 4.51s

copy_to_volume

takes 3.92s

get_text_embedding

takes 17s spins up embed

kNN

takes 0.469s search

$0 cost

First call API

curl "https://tu-zhenzhao--stylemi-app-v2-api-service.modal.run/search?query=pink%20coffin%20nails&amount=6"

takes 32s

Second quick call

takes 0.35s (kNN) + 0.40s (Text EB)

Reserved 1GB Memory Used: 832MB
Reserved 2 core CPU Used: 0.35 Core
GPU Memory Used: 4.53 GB

Start to timeout in 720s: total $0.24 = $0.18 (L4) + $0.06 (CPU) + $0.01 (Memory)

Now $8.30 see if it change if we let the container cold.

Frequent Request Test

40 request per min

In total request: 160 times

Lookup memory: 2.54/2.61GB

Cost in total: $0.88 = $0.64 (L40S) + $0.22 (Memory) + $0.02 (CPU)

Takes 720s = 12mins

This keeps api_serivce = 22 live container and Lookup = 3 live container.

About $0.0055 per call

Takeaway

container_idle_timeout=720s for both functions will keep creating containers. Here is how they work:

First call -> Activate 1 container in Lookup and 1 container in api_service.

This call will process like 900ms, during this time if there is any new request, 1 container in Lookup and 1 container in api_service will be created. Thus it turns to 2 containers.

Note: Lookup is class-based service (@app.cls), it only spins up new containers if an existing one is busy. Modal tried to reuse an existing Lookup container, but when multiple queries needed Lookup at the same time, it created extra instances (3 total).

So api_service will create container for every request. But Lookup will try to load as much as possible to prevent creating more containers.

So if those warm up container not cold down, they will keep hand new request without adding more.

Plan for optimal

✅ Strategy Breakdown

  1. Make Lookup (GPU model) handle more requests per container
    • Increase memory & concurrency so each container can process multiple API requests before needing a new one.
    • Keep container_idle_timeout longer so it can efficiently reuse the same container for new queries.
      container startup time avg: 25s
      Example Config:
@app.cls(
    gpu="L4",  # Keep L4 GPU for inference
    cpu=2.0,  # Enough CPU for request handling
    memory=4096,  # 🟢 Increase memory from 1024 → 4096 MB (4 GB)
    container_idle_timeout=300,  # 🟢 Keep warm for 5 min (was 720)
    concurrency_limit=4,  # 🟢 Allow handling 4 requests at the same time
)
class Lookup:
    ...

✅ More memory → Handle larger requests per container.
✅ Longer timeout (300s) → Avoid cold starts & GPU reloading.
✅ Concurrency limit = 4 Lookup requests at the same time, meaning fewer total containers.

  1. Make api_service (FastAPI web handler) lightweight & disposable
    • Shorten container_idle_timeout so unused API containers shut down quickly (reducing cost).
    • Keep low memory & no warm instances, so new API requests only spawn when necessary.
      container startup time avg: 3s
      Example Config:
@app.function(
    image=endpoints_image,
    enable_memory_snapshot=True,
    container_idle_timeout=30,  # 🔴 Reduce to 30s (was 720s)
    concurrency_limit=2,  # 🔴 Keep only 2 API containers max
    keep_warm=0,  # 🔴 No pre-warmed containers (save cost)
)
@modal.asgi_app()
def api_service():
    ...

Short timeout (30s) → API containers shut down almost immediately if idle.
Concurrency limit (2) → At most 2 containers alive, even under heavy load.
No keep_warm → New requests start fresh, but fast.

2025-2-15

I setup the configuration up there.

Then i tested running api call for 5 mins. There will be 200 request sending to server.

The report:

Cost in total: $0.31 = $0.21 (L4) + $0.07 (Memory) + $0.03 (CPU)

About $0.0015 per call.

2025-2-16

Started a container 605s, takes 14s for first spin up.

Total cost: $0.23 , thus the cost per second is $0.0003801653

5mins takes $0.09 = 10mins takes $0.18 = 1h takes $1.08

v1: Price Report

Function Test

API call for 5 mins each min got 40 calls. There will be 200 request sending to server.

Lookup Container: 591s (9m51s); 1 container

api_service: average 1min30s; 6 containers

Cost: $0.16 = $ 0.13(L4) + $0.01 (Memory) + $0.02 (CPU)

Prediction function:

Lookup GPU Cost of L4: $0.133200
Lookup CPU Cost of L4: $0.011400
Lookup Memory Cost of L4: $0.008004
------------------------------------
Lookup GPU Cost of None: $0.000000
Lookup CPU Cost of None: $0.010640
Lookup Memory Cost of None: $0.007470
------------------------------------
Estimated Lookup Service Cost: $0.152604
Estimated API Service Cost: $0.018110
Estimated Total Cost for Stress Test: $0.170714

More test:

API call for 5 mins each min got 40 calls. There will be 200 request sending to server.

Lookup Container: 410s (6m50s); 1 container

api_service: average 5min35s; 6 containers

Total cost: $0.11

Prediction Function:

💰 **Estimated Costs:**
- Lookup Service (L4, 20 concurrent inputs, 100s idle)
  ⮕ Execution Time: 410s ⮕ Cost: $0.104279
- API Service (5 concurrent inputs, 30s idle)
  ⮕ Execution Time: 335s ⮕ Cost: $0.010834

🚀 **Total Estimated Cost: $0.115113**