Yeah, and totally missed RAI part, billing, model deployment, security patches, rate-limiting, caching, dead GPUs, metrics, multiple regions, gov clouds, gdpr(or data locality issues), monitoring, alerting and god knows what else while at extreme loads.
GDPR doesn’t affect load, dead GPUs are no different than any software freeze, model is a file update, metrics are already scaling very well and even way way way bigger and they are very linear, security updates are hedged with gradual rollouts, canary, feature flags, etc.
From an ops perspective all of these things are already really well solved issues in a very scalable manner, because plenty of companies had to solve these issues before.
It’s even better here because you can throw millions in salaries to “steal” the insider info on how their production actually.
No doubt it is fast-paced but the complexity to go from 100k GPUs to 1M is much lower than from going from 1k to 10k GPUs.
All 3 big AI companies had the luxury that during the scaling phase they could do everything directly on production servers.
This is because customers were very very tolerant, and are still quite tolerant.
You can even set limits of requests to large users and shape the traffic.
Cloudflare in comparison, high-scale, low-latency, end users not tolerant at all to downtime, customers even less tolerant, clearly hostile actors that actively try to make your systems down, limited budget, a lot of different workloads, etc.
So, for LLM companies where you have to scale a single workload, largely from mostly free users, and where most paid customers can be throttled and nobody is going to complain because nobody knows what are the limits + a lot of tolerance to high-latency and even downtimes then you are very lucky.