I hated my time as an SRE. But … can’t it be done with some combination of canaries and blue green deployments and extensive testing? Where when things look good you just swap all the traffic to the good stuff keeping the rollback hot etc etc?
That's how we got 99.99% at Netflix. And it cost a lot of money. But a canary implies that something may go wrong and you have to roll back. The canary is still production traffic, so some transactions would fail, which isn't allowed for this kind of workload.
I image you'd have to use shadow execution, where you roll out a full second copy, run every transaction through both, and compare the results. And then, only after a certain time, switch traffic to the new infra and tear down the old.
But you would need a ton of extra hardware (more than double) and a lot of ways to keep data in sync. And of course if you put an LLM or other non-deterministic system in there, that's a whole other can of worms.
Like I said, a fun problem to solve. :)
There are different kinds of updates that influence options and feasibility. Keeping in mind that deep in the heart of an exchange is a single threaded process, the sequencer. Therefore, you have three layers, external facing protocols, sequencer/matching engines, and internal interfaces. Internal interfaces are the easiest for b/g. External protocols, any change worth its weight changes the protocol and therefore requires participants to change their codebases too. Versioning protocols is an option, but still the integration with consumers is much more transparent and usually you have them test on pre-prod environments, occasionally also requiring attestation and conformance testing (regulated markets). Sequencer and matching engine are at the core. You could do parallel runs but not b/g. Theoretically you could abstract the matching engine and keep a barebones sequencer immutable, but this will have performance implications. So yes, you can do things, but not in a completely transparent way, unless if you introduce an “upgrade jitter” to give you a window for transparent upgrades. It’s an interesting domain, I think people will just accept occasional downtimes as a better option than constant jitter cost.