logoalt Hacker News

lexandstufflast Wednesday at 10:13 AM1 replyview on HN

Not to mention, if it's an ML workload, you'll also have to factor in downloading the weights and loading them into memory, which can double that time or more.


Replies

rvnxlast Wednesday at 10:15 AM

According to the press release, "we achieved an impressive Time-to-First-Token of approximately 19 seconds for a gemma3:4b model"

Imagine, you have a very small weak model, and you have to wait 20 seconds for your request.

show 2 replies