TLDR:

10QPS is ~13k/mo for GPUs and licensing (excluding a $226/mo overhead server), you can increase latency to increase throughput for the same cost, and this cost scales linearly with QPS so $26k is 20qps.

Intro

The base latency you’re seeing right now is with 4 L4 GPUs allocated per request. For more throughput we allocate an additional 4 GPUs per request.

When we say QPS, this means requests being sent at the exact same ms, so the system will work significantly better if requests are staggered.

Cost saving

We can sacrifice single query speed to improve throughput by decreasing the number of GPUs per request, the speed scales linearly, so 2 is half as fast as 4, while 3 is 75% the speed of 4.

If you dynamically allocate

Cost calculations

We run with a combination of dedicated and spot instances. With L4’s it’s occasionally possible for Amazon to temporarily take all your spot GPUs. This leaves you with only the throughput from your dedicated instances.

Pongo puts a 5% service fee on top of allocated compute cost.

Overhead server cost: $0.3/hr, → $226/mo per instance (one is enough)

L4 dedicated cost: $0.8048/hr → $579.45600/mo per GPU

L4 spot cost: ~$0.4/hr → $288/mo per GPU

So for 10 queries per second with current latency, we’d be looking at 40 GPUs