logoalt Hacker News

omneitylast Wednesday at 11:28 AM5 repliesview on HN

> Time-to-First-Token of approximately 19 seconds for a gemma3:4b model (this includes startup time, model loading time, and running the inference)

This is my biggest pet-peeve with serverless GPU. 19 seconds is a horrible latency from the user’s perspective and that’s a best case scenario.

If this is the best one of the most experienced teams in the world can do, with a small 4B model, then it feels like serverless is really restricted to non-interactive use cases.


Replies

digganlast Wednesday at 11:49 AM

That has to be cold-start, and next N requests would surely be using the already started thing? It sounds bananas they'd even mention using something like that with 19 seconds latency for all requests in any context.

show 1 reply
happyopossumlast Wednesday at 7:43 PM

Sure, but how often is an enterprise deployed LLM application really cold-starting? While you could run this for one-off and personal use, this is probably more geared towards bursty ‘here’s an agent for my company sales reps’ kind of workloads, so you can have an instance warmed, then autoscale up at 8:03am when everyone gets online (or in the office or whatever).

At that point, 19 seconds looks great, as lower latency startup times allow for much more efficient autoscaling.

wut42last Wednesday at 11:56 AM

Definitely -- and yet it's kinda a feat compared to other solutions: when i tried Runpod Serverless i could wait up to five minutes for a cold start to a even more smaller model than a 4B.

show 1 reply
infectolast Wednesday at 12:18 PM

If you were running a real business with these would the aim not be to overprovision and to setup auto scaling in such a way that you always have excess capacity?

show 1 reply
bravesoul2last Wednesday at 11:36 AM

Looks like GPU instances not "lambda", so presumable you would over-provision to compensate.