AI data centers are a whole lot of pipelines pumping data around utilizing queues. They want those expensive power hungry cards near 100% utilized at all times. So they have a queue of jobs on each system ready to run, feeding into the GPU memory as fast as completed jobs are read out of memory (and passed into the next stage) and they aim to have enough backlog in these queues to keep the pipeline full. You see responses in seconds but at the data center you're request was broken into jobs, passed around into queues, processed in an orderly manner and pieced back together.
With fast mode you're literally skipping the queue. An outcome of all of this is that for the rest of us the responses will become slower the more people use this 'fast' option.
I do suspect they'll also soon have a slow option for those that have Claude doing things overnight with no real care for latency of the responses. The ultimate goal is pipelines of data hitting 100% hardware utilization at all times.
AI data centers are a whole lot of pipelines pumping data around utilizing queues. They want those expensive power hungry cards near 100% utilized at all times. So they have a queue of jobs on each system ready to run, feeding into the GPU memory as fast as completed jobs are read out of memory (and passed into the next stage) and they aim to have enough backlog in these queues to keep the pipeline full. You see responses in seconds but at the data center you're request was broken into jobs, passed around into queues, processed in an orderly manner and pieced back together.
With fast mode you're literally skipping the queue. An outcome of all of this is that for the rest of us the responses will become slower the more people use this 'fast' option.
I do suspect they'll also soon have a slow option for those that have Claude doing things overnight with no real care for latency of the responses. The ultimate goal is pipelines of data hitting 100% hardware utilization at all times.