Modal Cost for Alex

1. GPU Running Time

GPU Tasks	Costs
Nvidia H100	$0.001267 / sec
Nvidia A100, 80 GB	$0.000944 / sec
Nvidia A100, 40 GB	$0.000772 / sec
Nvidia L40S	$0.000542 / sec
Nvidia A10G	$0.000306 / sec
Nvidia L4	$0.000222 / sec
Nvidia T4	$0.000164 / sec

Catalog class runs on a GPU (H100 in your example) for image embeddings.
Nvidia L40S, 2.7k images,
Lookup class runs on a GPU (L4 in your code snippet) for queries.
720s timeout = $0.000222 \times 720 \times 5$ = $0.7992 per hour

2. CPU Running Time

CPU Physical core (2 vCPU equivalent) $0.000038 / core / sec

Each class also reserves CPU resources (e.g., cpu=4.0 for Catalog, cpu=2.0 for Lookup).
cpu=4.0
cpu=2.0
If your system or function runs on CPU for a certain duration, that can also incur cost.

3. Memory Allocation

Memory $0.00000667 / GiB / sec

For instance, memory=16384 (16 GB) or memory=1024 (1 GB). There is a per-GB-hour cost for memory.
[ ]

4. Cloud Storage & Data Transfer

Volume usage (the modal.Volume for models/lancedb).
CloudBucketMount usage (your R2 buckets).
Data egress from Modal to your users (e.g. returning images or large JSON to the client).

5. Idle Container Time (container_idle_timeout)

If you set a container idle timeout to 720 seconds, for example, you pay for any time the container remains active (whether or not a request is being handled).
timeout = 720s

Testing Data

get_image :

$0.40 takes 347.9s + 17.06s

get_image_urns:

Reserved 16GB Memory Used: 1.06GB
Reserved 4 core CPU Used: 0.11 Core
GPU Memory Used: 16.40GB

Cost: 23mins 17s

In total: $1.42 = $0.92 (L40S) + $0.27 (CPU) + $0.19 (Memory)

from D "cost me like 4.50 and 19min for the 2.7k images in H100"

save_image_embedding

takes 4.51s

copy_to_volume

takes 3.92s

get_text_embedding

takes 17s spins up embed

kNN

takes 0.469s search

$0 cost

First call API

curl "https://tu-zhenzhao--stylemi-app-v2-api-service.modal.run/search?query=pink%20coffin%20nails&amount=6"

takes 32s

Second quick call

takes 0.35s (kNN) + 0.40s (Text EB)

Reserved 1GB Memory Used: 832MB
Reserved 2 core CPU Used: 0.35 Core
GPU Memory Used: 4.53 GB

Start to timeout in 720s: total $0.24 = $0.18 (L4) + $0.06 (CPU) + $0.01 (Memory)

Now $8.30 see if it change if we let the container cold.

Frequent Request Test

40 request per min

In total request: 160 times

Lookup memory: 2.54/2.61GB

Cost in total: $0.88 = $0.64 (L40S) + $0.22 (Memory) + $0.02 (CPU)

Takes 720s = 12mins

This keeps api_serivce = 22 live container and Lookup = 3 live container.

About $0.0055 per call

Takeaway

container_idle_timeout=720s for both functions will keep creating containers. Here is how they work:

First call -> Activate 1 container in Lookup and 1 container in api_service.

This call will process like 900ms, during this time if there is any new request, 1 container in Lookup and 1 container in api_service will be created. Thus it turns to 2 containers.

Note: Lookup is class-based service (@app.cls), it only spins up new containers if an existing one is busy. Modal tried to reuse an existing Lookup container, but when multiple queries needed Lookup at the same time, it created extra instances (3 total).

So api_service will create container for every request. But Lookup will try to load as much as possible to prevent creating more containers.

So if those warm up container not cold down, they will keep hand new request without adding more.

Plan for optimal

✅ Strategy Breakdown

Make Lookup (GPU model) handle more requests per container
- Increase memory & concurrency so each container can process multiple API requests before needing a new one.
- Keep container_idle_timeout longer so it can efficiently reuse the same container for new queries.
  container startup time avg: 25s
  Example Config:

@app.cls(
    gpu="L4",  # Keep L4 GPU for inference
    cpu=2.0,  # Enough CPU for request handling
    memory=4096,  # 🟢 Increase memory from 1024 → 4096 MB (4 GB)
    container_idle_timeout=300,  # 🟢 Keep warm for 5 min (was 720)
    concurrency_limit=4,  # 🟢 Allow handling 4 requests at the same time
)
class Lookup:
    ...

✅ More memory → Handle larger requests per container.
✅ Longer timeout (300s) → Avoid cold starts & GPU reloading.
✅ Concurrency limit = 4 Lookup requests at the same time, meaning fewer total containers.

Make api_service (FastAPI web handler) lightweight & disposable
- Shorten container_idle_timeout so unused API containers shut down quickly (reducing cost).
- Keep low memory & no warm instances, so new API requests only spawn when necessary.
  container startup time avg: 3s
  Example Config:

@app.function(
    image=endpoints_image,
    enable_memory_snapshot=True,
    container_idle_timeout=30,  # 🔴 Reduce to 30s (was 720s)
    concurrency_limit=2,  # 🔴 Keep only 2 API containers max
    keep_warm=0,  # 🔴 No pre-warmed containers (save cost)
)
@modal.asgi_app()
def api_service():
    ...

✅ Short timeout (30s) → API containers shut down almost immediately if idle.
✅ Concurrency limit (2) → At most 2 containers alive, even under heavy load.
✅ No keep_warm → New requests start fresh, but fast.

2025-2-15

I setup the configuration up there.

Then i tested running api call for 5 mins. There will be 200 request sending to server.

The report:

Cost in total: $0.31 = $0.21 (L4) + $0.07 (Memory) + $0.03 (CPU)

About $0.0015 per call.

2025-2-16

Started a container 605s, takes 14s for first spin up.

Total cost: $0.23 , thus the cost per second is $0.0003801653

5mins takes $0.09 = 10mins takes $0.18 = 1h takes $1.08

v1: Price Report

Function Test

API call for 5 mins each min got 40 calls. There will be 200 request sending to server.

Lookup Container: 591s (9m51s); 1 container

api_service: average 1min30s; 6 containers

Cost: $0.16 = $ 0.13(L4) + $0.01 (Memory) + $0.02 (CPU)

Prediction function:

Lookup GPU Cost of L4: $0.133200
Lookup CPU Cost of L4: $0.011400
Lookup Memory Cost of L4: $0.008004
------------------------------------
Lookup GPU Cost of None: $0.000000
Lookup CPU Cost of None: $0.010640
Lookup Memory Cost of None: $0.007470
------------------------------------
Estimated Lookup Service Cost: $0.152604
Estimated API Service Cost: $0.018110
Estimated Total Cost for Stress Test: $0.170714

More test:

API call for 5 mins each min got 40 calls. There will be 200 request sending to server.

Lookup Container: 410s (6m50s); 1 container

api_service: average 5min35s; 6 containers

Total cost: $0.11

Prediction Function:

💰 **Estimated Costs:**
- Lookup Service (L4, 20 concurrent inputs, 100s idle)
  ⮕ Execution Time: 410s ⮕ Cost: $0.104279
- API Service (5 concurrent inputs, 30s idle)
  ⮕ Execution Time: 335s ⮕ Cost: $0.010834

🚀 **Total Estimated Cost: $0.115113**

2025-2-17

$0.53

$0.59

$0.64

Created cost_service.py and integrated to Lookup function.

Now it can automatically recording cost for Lookup function.

But still looking for a method to compute api_service.

2025-2-18

Test 2:
Lookup: 37:14 - 33:40 = 3min 34s = 214s
Execution time logged successfully! Lookup: 190.90185642242432s

API: 36:06 - 33:43 = 2min 23s = 143s

214-143 = 71s

Test 1:
Lookup: 58:02 - 52:27 = 3min 35s = 215s
Execution time logged successfully! Lookup: 193.76613903045654s

API: 56:58 - 54:35 = 2min 23s = 143s

215-143 = 72s

Running test

$0.64

$0.70

$0.86

$1.05

$1.17
0.115

$1.32

Final Test

Starting point: $1.32

First round: 50 call/min, 3mins
Second round: 80 call/min 3mins
Third round: 80 call/min 3min
Forth round: 80 call/min 3min
Fifth round: 40 call/ second 1 min

End point: $1.68