This read was a blast from the past. I'm not going to comment on much from OP and instead give a little of my experience there.
Straight out of college in 2017 I joined the Compute Fabric Controller (FC) org as a SWE on an absolutely wonderful team that dealt with mostly container management, VM and Host fault handling & repair policies, and Fabric to Host communication with most of our code in the FC. I drove our team's efforts on the never ending "Unhealthy" node workstream, the final catch-all bucket in the Host fault handler mentioned in OP. I also did heavy work in optimizing repair policies, reactive diagnostics for improved repairs and offline analysis, OS and HW telemetry ingestion from the Host like SEL events into the repair manager in real time, wrote the core repair manager state machine in the new AZ level service that we decoupled from the Fabric, drove Kernel Soft Reboot (KSR)/Tardigrade as a repair action for minimal VM impact for some host repairs, and helped stand up into eventually owning a new warm path RCA attribution service to help drive the root underlying causes of reliability issues and feed some offline analysis back into the live repair manager.
The work was difficult but also really really interesting. For example, Balancing repair policies around reliability is tricky. There's a constant fight in repair policies in grey situations between minimizing total VM downtime vs any VM interruptions/reboots/heals at all, because the repair controller doesn't have perfect information. If telemetry is pointing to VMs being degraded or down on the host, yet in reality they're not, we are the ones inducing the VM downtime by performing an impactful repair. If we wait a little while before taking an impactful repair action, it may be a transient issue that will resolve itself in the moment, at which point we can do much less impactful repairs after like Live Migration if the host is healthy enough. On the flipside, if some telemetry is saying the VMs are up yet they're down in reality and we just don't know it yet, taking time to collect diagnostics and then take a repair action(s) leads to only more overall total downtime.
When I joined in 2017 our team was 7 or 8 people including myself, yet had enough work for at least double that amount of people. On-call was a nightmare the first 2 years. Building Azure back then was like trying to build a car while already sitting behind the steering wheel of that car as it was already barreling down the highway. Everyone on my immediate team the first couple years were a joy to work with, highly competent, hard working, and all of us working absurd hours. For me 60hrs/wk was avg, with many weeks ~80 and a few weeks ~100. Other than the hours though, it was a splendid team environment and I'd like to think we had good engineering culture within our team, though maybe I'm biased. Engineering culture and quality did however vary substantially between orgs and teams. We were heavily under resourced and always needed more headcount, as did nearly every other team in Azure Compute. That never changed during my tenure even though my team's size ballooned to ~20 by 2020, and eventually big enough to where we had to split the team. There was high turnover from the lack of headcount and overwork which was somewhat alleviated by lowering the hiring bar... which obviously opened up another can of worms. This resourcing issue might explain, in part, why Azure is the way it is. We were always playing catchup as a result of the woes of chronic understaffing for years. I eventually burnt out which turned into spiraling mental health, physical health issues, constant panic attacks, and then a full blown mental health crisis after which I took LOA and eventually left the company. I came back briefly for a bit during LOA, and learned that the RCA service I'd built with the help of a coworker (who also left Azure) and was only a small part of our overall workload, had turned into a full fledged team of 9 people dedicated to working on that service in my absence. I know that stating some of this might affect my employment in the future but I don't really care. I know I'm not alone in experiencing burnout working in Azure. It wasn't my manager's fault either, he was amazing. He'd often ask and I would incorrectly yet confidently reassure him that I wasn't burning out but I simply didn't notice the signs. Things are better now though and I'm just happy to be here.
Kudos to the many brilliant people I worked alongside there, I hope you're all doing great.
2 years of 60+ hours weeks is not good engineering culture, or any kind of culture.
The first and most important lesson, that I try to each every young developer starting in the industry: Go home after clocking in your hours negotiated in your contract. Drop your pen. Go home. Sleep well.
And I hope, that every sensible senior developer in here does the same. Lead by example. Maybe it would prevent a few burnouts in this industry.
And if you are a manager, then send your people home after they have clocked in their negotiated hours. For their own well-being. It’s your responsibility. And if it’s not working, then force them to go home.
I hope you are better by now and got through the tough time. All the best for you!