logoalt Hacker News

exitbyesterday at 4:08 PM2 repliesview on HN

An operator at load capacity can either refuse requests, or move the knobs (quantization, thinking time) so requests process faster. Both of those things make customers unhappy, but only one is obvious.


Replies

codefloyesterday at 4:21 PM

This is intentional? I think delivering lower quality than what was advertised and benchmarked is borderline fraud, but YMMV.

show 7 replies
sh3rl0ckyesterday at 5:10 PM

I'd wager that lower tok/s vs lower quality of output would be two very different knobs to turn.