I'm in fintech right now, experiencing that pain. We do a sort of "reverse waterfall".
We respond to spikes in increased exceptions, then read logs to try to figure out why something "went wrong". If we can change some code to fix it, we do so. But since we don't have confidence in our fixes, we have to move further back to the design and guess how the system is supposed to work. I'm hoping next month we start a dialogue with with upstream and ask how we're supposed to use their API.