My favorite tool for trying scary complicated things in an unknown space is the feature flag. This works even if you have zero tests and no documentation. The only thing you need is the live production system and a way to toggle the flag at runtime.
If you can ship your hypothesis along with an effectively unaltered version of prod, the ability to test things without breaking other things becomes much more feasible. I've never been in a real business scenario where I wasn't able to negotiate a brief experimental window during live business hours for at least one client.
Feature flags are like bloom filters. They make 98 out of 100 situations better and they make the other 2 worse. When performance is the issue that’s usually fine. When reliability is the issue, that’s not sufficient.
If you work on fifty feature toggles a year, one of them is going to go wrong. If your team is doing a few hundred, you’re gonna have oopsies.
Most of the problematic cases are where the code is set up so that the old path and the new one can’t bypass each other cleanly. They get tangled up and maybe the toggle gets implemented inverted where it’s difficult to remove the old path without breaking the new.
You can go even further with something like the gem scientist at the application level, or tee-testing at the data store level. Compare A and A', record the result, and return A. Eventually, you reach 100% compatibility between the two (or only deviations that are desirable) and can remove A, leaving only A'
I also like recording and replaying production traffic, as well, so that you can do your tee-testing in an environment that doesn't affect latency for production, but that's not quite the same thing.
You’ve just resolved a problem I had. I had this problem on a search engine, but I made it as a “v2”. And I told customers to switch to v2. And you know the v2 problem: Discrepancies that customers like. So both versions have fans, but we really need to pull the plug on v1. You’ve just solved it: I should have indexed even records with v1, odd records with v2. Then only I would know which engine was used.
While very powerful, I think it's worth calling out some pitfuls. A few things we've ran into - long lived feature flags that are never cleaned up (which usually cause zombie or partially dead code) - rollout drift where different environments or customers have different flags set and it's difficult to know who actually has the feature - not flagging all connected functionality (i.e. one API is missing the flag that should have had it)
A good decom/cleanup strategy definitely helps