> searching for the cause of a bug let you discover multiple "forgotten" servers, ETL jobs, crons all interacting together. And no one knows why they do [..]
And then comes the "beginner's" mistake. They don't seem to be doing anything. Let's remove them, what could possibly go wrong?
If it’s done in a controlled manner with the ability to revert quickly, you’ve just instituted a “scream test[0].”
____
[0] https://open.substack.com/pub/lunduke/p/the-scream-test
(Obviously not the first description of the technique as you’ll read, but I like it as a clear example of how it works)
I've fixed more than enough bugs by just removing the code and doing it the right way.
Of course you can get lost on the way but worst case is you learn the architecture.
The next mistake is thinking that completely re-writing the system will clean out the cruft.
that's a management/cultural problem. if no one knows why it's there, the right answer is to remove it and see what breaks. If you're too afraid to do anything, for nebulous cultural reasons, you're paralyzed by fear and no one's operating with any efficiency. It hits different when it's the senior expert that everyone revere's that invented everything the company depends on that does it, vs a summer intern vs Elon Musk bought your company (Twitter). Hate the man for doing it messily and ungraciously, but you can't argue with the fact that it gets results.
If you follow the prescribed procedure and involve all required management, it stops being a beginner's mistake; and given reasonable rollback provisions it stops being a mistake at all because if nobody knows what the thing is it cannot be very important, and a removal attempt is the most effective and cost efficient way to find out whether the ting can be removed.