It would be nice to have a lot more detail. The WTF sections are the best part. Sounds like your gear needs "this side towards enemy" sign and/or the right affordances so it only goes in one way.
Did you standardize on layout at the rack level? What poke-yoke processes did you put into place to prevent mistakes?
What does your metal->boot stack look like?
Having worked for two different cloud providers and built my own internal clouds with PXE booted hosts, I too find this stuff fascinating.
Also take utmost advantage of a new DC when you are booting it to try out all the failure scenarios you can think of and the ones you can't through randomized fault injection.
> It would be nice to have a lot more detail
I'm going to save this for when I'm asked to cut the three paras on power circuit types.
Re: standardising layout at the rack level; we do now! we only figured this out after site #2. It makes everything so much easier to verify. And yeah, validation is hard - manually doing it thus far; want to play around with scraping LLDP data but our switch software stack has a bug :/. It's an evolving process, the more we work with different contractors, the more edge cases we unearth and account for. The biggest improvement is that we have built a internal DCIM that templates a rack design and exports a interactive "cabling explorer" for the site techs - including detailed annotated diagrams of equipment showing port names, etc... The screenshot of the elevation is a screenshot of part of that tool.
> What does your metal->boot stack look like?
We've hacked together something on top of https://github.com/danderson/netboot/tree/main/pixiecore that serves a debian netboot + preseed file. We have some custom temporal workers to connect to Redfish APIs on the BMCs to puppeteer the contraption. Then a custom host agent to provision QEMU VMs and advertise assigned IPs via BGP (using FRR) from the host.
Re: new DCs for failure scenarios, yeah we've already blown breakers etc... testing stuff (that's how we figured out our phase balancing was off). Went in with a thermal camera on another. A site in AMS is coming up next week and the goal for that is to see how far we can push a fully loaded switch fabric.