Not to mention, if it's an ML workload, you'll also have to factor in downloading the weig...

lexandstuff • last Wednesday at 10:13 AM • 1 reply • view on HN

Not to mention, if it's an ML workload, you'll also have to factor in downloading the weights and loading them into memory, which can double that time or more.

Replies

rvnx • last Wednesday at 10:15 AM

According to the press release, "we achieved an impressive Time-to-First-Token of approximately 19 seconds for a gemma3:4b model"

Imagine, you have a very small weak model, and you have to wait 20 seconds for your request.

➕ show 2 replies

alt Hacker News

Replies