logoalt Hacker News

sitkackyesterday at 10:40 PM1 replyview on HN

It would be nice to have a lot more detail. The WTF sections are the best part. Sounds like your gear needs "this side towards enemy" sign and/or the right affordances so it only goes in one way.

Did you standardize on layout at the rack level? What poke-yoke processes did you put into place to prevent mistakes?

What does your metal->boot stack look like?

Having worked for two different cloud providers and built my own internal clouds with PXE booted hosts, I too find this stuff fascinating.

Also take utmost advantage of a new DC when you are booting it to try out all the failure scenarios you can think of and the ones you can't through randomized fault injection.


Replies

ca508yesterday at 10:58 PM

> It would be nice to have a lot more detail

I'm going to save this for when I'm asked to cut the three paras on power circuit types.

Re: standardising layout at the rack level; we do now! we only figured this out after site #2. It makes everything so much easier to verify. And yeah, validation is hard - manually doing it thus far; want to play around with scraping LLDP data but our switch software stack has a bug :/. It's an evolving process, the more we work with different contractors, the more edge cases we unearth and account for. The biggest improvement is that we have built a internal DCIM that templates a rack design and exports a interactive "cabling explorer" for the site techs - including detailed annotated diagrams of equipment showing port names, etc... The screenshot of the elevation is a screenshot of part of that tool.

> What does your metal->boot stack look like?

We've hacked together something on top of https://github.com/danderson/netboot/tree/main/pixiecore that serves a debian netboot + preseed file. We have some custom temporal workers to connect to Redfish APIs on the BMCs to puppeteer the contraption. Then a custom host agent to provision QEMU VMs and advertise assigned IPs via BGP (using FRR) from the host.

Re: new DCs for failure scenarios, yeah we've already blown breakers etc... testing stuff (that's how we figured out our phase balancing was off). Went in with a thermal camera on another. A site in AMS is coming up next week and the goal for that is to see how far we can push a fully loaded switch fabric.

show 1 reply