logoalt Hacker News

megabless123yesterday at 3:54 PM4 repliesview on HN

noob question: why would increased demand result in decreased intelligence?


Replies

exitbyesterday at 4:08 PM

An operator at load capacity can either refuse requests, or move the knobs (quantization, thinking time) so requests process faster. Both of those things make customers unhappy, but only one is obvious.

show 2 replies
vidarhyesterday at 4:01 PM

It would happen if they quietly decide to serve up more aggressively distilled / quantised / smaller models when under load.

show 3 replies
awestrokeyesterday at 4:06 PM

I've seen some issues with garbage tokens (seemed to come from a completely different session, mentioned code I've never seen before, repeated lines over and over) during high load, suspect anthropic have some threading bugs or race conditions in their caching/inference code that only happen during very high load

Wheaties466yesterday at 4:03 PM

from what I understand this can come from the batching of requests.

show 1 reply