logoalt Hacker News

Youdenyesterday at 1:21 PM3 repliesview on HN

There's a lot of focus in this thread on the atomic clocks but in most datacenters, they're not actually that important and I'm dubious that the hyperscalers actually maintain a "fleet" of them, in the sense that there are hundreds or thousands of these clocks in their datacenters.

The ultimate goal is usually to have a bunch of computers all around the world run synchronised to one clock, within some very small error bound. This enables fancy things like [0].

Usually, this is achieved by having some master clock(s) for each datacenter, which distribute time to other servers using something like NTP or PTP. These clocks, like any other clock, need two things to be useful: an oscillator, to provide ticks, and something by which to set the clock.

In standard off-the-shelf hardware, like the Intel E810 network card, you'll have an OXCO, like [1], with a GPS module. The OXCO provides the ticks, the GPS module provides a timestamp to set the clock with and a pulse for when to set it.

As long as you have GPS reception, even this hardware is extremely accurate. The GPS module provides a new timestamp, potentially accurate to within single-digit nanoseconds ([2] datasheet), every second. These timestamps can be used to adjust the oscillator and/or how its ticks are interpreted, such that you maintain accuracy between the timestamps from GPS.

The problem comes when you lose GPS. Once this happens, you become dependent on the accuracy of the oscillator. An OXCO like [1] can hold to within 1µs accuracy over 4 hours without any corrections but if you need better than that (either more time below 1µs, or more accurate than 1µs over the same time), you need a better oscillator.

The best oscillators are atomic oscillators. [2] for example can maintain better than 200ns accuracy over 24h.

So for a datacenter application, I think the main reason for an atomic clock is simply for retaining extreme accuracy in the event of an outage. For quite reasonable accuracy, a more affordable OXCO works perfectly well.

[0]: https://docs.cloud.google.com/spanner/docs/true-time-externa...

[1]: https://www.microchip.com/en-us/product/OX-221

[2]: https://www.u-blox.com/en/product/zed-f9t-module

[3]: https://www.microchip.com/en-us/products/clock-and-timing/co...


Replies

dave_universetfyesterday at 6:02 PM

I don't know about all hyperscalers, but I have knowledge of one of them that has a large enough fleet of atomic frequency standards to warrant dedicated engineering. Several dozen frequency standards at least, possibly low hundreds. Definitely not one per machine, but also not just one per datacenter.

As you say, the goal is to keep the system clocks on the server fleet tightly aligned, to enable things like TrueTime. But also to have sufficient redundancy and long enough holdover in the absence of GNSS (usually due to hardware or firmware failure on the GNSS receivers) that the likelihood of violating the SLA on global time uncertainty is vanishingly small.

The "global" part is what pushes towards having higher end frequency standards, they want to be able to freewheel for O(days) while maintaining low global uncertainty. Drifting a little from external timescales in that scenario is fine, as long as all their machines drift together as an ensemble.

The deployment I know of was originally rubidium frequency standards disciplined by GNSS, but later that got upgraded to cesium standards to increase accuracy and holdover performance. Likely using an "industrial grade" cesium standard that's fairly readily available, very good but not in the same league as the stuff NIST operates.

Animatsyesterday at 7:04 PM

GPS satellites have their own atomic clocks. They're synchronized to clocks at the GPS control center at Schriever Space Force Base, Colorado, formerly Falcon AFB. They in turn synchronize to NIST in Boulder, Colorado. GPS has a lot of ground infrastructure checking on the satellites, and backup control centers. GPS should continue to work fine, even if there's some absolute error vs. NIST. Unless there have been layoffs.

toast0yesterday at 6:17 PM

> There's a lot of focus in this thread on the atomic clocks but in most datacenters, they're not actually that important and I'm dubious that the hyperscalers actually maintain a "fleet" of them, in the sense that there are hundreds or thousands of these clocks in their datacenters.

I mean, fleets come in all sizes; but if you put one atomic reference in each AZ of each datacenter, there's a fleet. Maybe the references aren't great at distributing time, so you add a few NTP distributors per datacenter too and your fleet is a little bigger. Google's got 42 regions in GCP, so they've got a case for hundreds of machines for time (plus they've invested in spanner which has some pretty strict needs); other clouds are likely similar.