Clock synchronization is a nightmare

204 points • by grep_it • last Tuesday at 6:59 PM • 138 comments • view on HN

Comments

I highly recommend anyone to look up how PTP works and how it compares to NTP. Clock sync is very interesting. When I joined an HFT company, first thing I did was understand this stuff. We care about it a lot[1].

If you want a specific question to answer, answer this: why does PTP need hardware timestamping to achieve high precision (where the network card itself assigns timestamps to packets, rather than having the kernel do it as part of TCP/IP processing)? If we use software timestamps, why can we do microsecond precision at best? If you understand this, it goes a very long way to understanding the core ideas behind precise clock sync.

Once you have a solid understanding of PTP, look into White Rabbit. They’re able to sync two clocks with sub-ns precision. In case that isn’t obvious, that is absolutely insane.

[1] So do a lot of people. For example audio engineers. Once, an audio engineer absolutely talked my ear off about ptp. I had no idea that audio people understood clock sync so well but they do!

➕ show 4 replies

josephg • yesterday at 10:57 PM

> When two transactions happen at nearly the same time on different nodes, the database must determine which happened first. If clocks are out of sync, the database might order them incorrectly, violating consistency guarantees.

This is only true if you use wall clock time as part of your database’s consistency algorithm. Generally I think this is a huge mistake. It’s almost always much easier to swap to a logical clock - which doesn’t care about wall time. And then you don’t have to worry about ntp.

The basic idea is this: event A happened before event B iff A (or something that happened after A) was observed by the node that generated B before B was generated. As a result, you end up with a dag of events - kind of like git. Some events aren’t ordered relative to one another. (We say, they happened concurrently). If you ever need a global order for all events, you can deterministically pick an arbitrary order for concurrent events by comparing ids or something. And this will give you a total order that will be the same on all peers.

If you make database events work like this, time is a little more complex. (It’s a graph traversal rather than simple numbers). But as a result the system clock doesn’t matter. No need to worry about atomic clocks, skew, drift, monotonicity, and all of that junk. It massively simplifies your system design.

➕ show 3 replies

NelsonMinar • yesterday at 10:17 PM

On the flipside, clock sync for civilians has never been easier. Thanks to NTP any device with an Internet connection can pretty easily get time accurate to 1 second, often as little as 10 ms. All major consumer computers are preconfigured to sync time to one of several reliable NTP pools.

This post is about more complicated synchronization for more demanding applications. And it's very good. I'm just marveling at how in my lifetime I from "no clock is ever set right" to assuming most anything was within a second of true time.

➕ show 3 replies

simonebrunozzi • today at 7:30 AM

Back when I was studying computer science, I was taking the OS exam and the part about Lamport timestamp [0] was optional, but I had studied it because I loved it. When I mentioned it to my professor, he was so happy to hear something new that day that he asked me to describe it in details. This was the year 2001.

Many years later, in 2020, I ended up living in San Francisco, and I had the fortune to meet Leslie Lamport after I sent him a cold email. Lovely and smart guy. This is the text of the first part of that email, just for your curiosity:

Hey Leslie!

You have accompanied me for more than 20 years. I first met your name when studying Lamport timestamps.

And then on, and on, and on, up to a few minutes ago, when I realized that you are also behind the paper and the title of "Byzantine Generals problem", renamed after the "Albanian" generals to the suggestion of Jack Goldberg. Who is he? [1]

[0]: https://en.wikipedia.org/wiki/Lamport_timestamp

[1]: Jack Goldberg (now retired) was a computer scientist and Lamport's manager at SRI.

j_seigh • yesterday at 8:21 PM

Ok,so people use NTP to "synchronize" their clocks and then write applications that assume the clocks are in exact sync and can use timestamps for synchronization, even though NTP can see the clocks aren't always in sync. Do I have that right?

➕ show 3 replies

b112 • today at 6:23 AM

The article doesn't cover the inane stupid that is:

* NTP pool server usage requires using DNS

* people have DNSSEC setup, which requires accurate time or it fails

So if your clock is off, you cannot lookup NTP pool servers via DNS, and therefore cannot set your clock.

This sheer stupid has been discussed with package maintainers of major distros, with ntpsec, and the result is a mere shrug. Often, the answer is "but doesn't your device have a battery backed clock?", which is quite unhelpful. Many devices (routers, IOT devices, small boards, or older machines, etc) don't have a battery backed clock, or alternatively the battery may just have died.

Beyond that, the ntpsec codebase has a horrible bug where if DNS is not available when ntpsec starts, pool server addresses are never, ever retried. So if you have a complete power-fail in a datacentre rack, and your firewalls take a little longer to boot than your machines, you'll have to manually restart ntpsec to even get it to ever sync.

When discussing this bug the ntpsec lads were confused that DNS might not exist at times.

Long story short, make sure you aren't using DNS in any capacity, in NTP configs, and most especially in ntpsec configs.

One good source is just using the IPs provided by NIST. Pool servers may seem fine, but I'd trust IPs assigned to NIST to exist longer than any DNS anyhow. EG, for decades.

georgelyon • yesterday at 7:42 PM

Unfortunate that the author doesn’t bring up FoundationDB version stamps, which to me feel like the right solution to the problem. Essentially, you can write a value you can’t read until after the transaction is committed and the synchronization infrastructure guarantees that value ends up being monotonically increasing per transaction. They use similar “write only” operations for atomic operations like increment.

➕ show 2 replies

eatsome • today at 7:40 AM

For an article written about time, I would have thought there'd be a timestamp on the blog post. Just something to think about if someone stumbles upon this in a few years.

kobieps • yesterday at 7:38 PM

Even just a single accurate clock is a nightmare... https://www.npr.org/2025/12/21/nx-s1-5651317/colorado-us-off...

➕ show 1 reply

pdeva1 • today at 2:27 AM

AWS has the Google TrueTime equivalent precision clock available for public use[1] which makes this problem much easier to solve now. Auora DSQL uses it. Even third party db's like YugabyteDb make use of it.

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time...

amiune • today at 7:55 AM

As a teacher I love the way Judah Levine explains

gdcohen • today at 2:33 AM

Take a look at Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization - https://www.usenix.org/conference/nsdi18/presentation/geng (commercial version by the same authors - clockwork.io).

mgaunard • yesterday at 7:37 PM

Another protocol that's not mentioned is PPS and its variants, such as WhiteRabbit.

A regular pulse is emitted from a specialized high-precision device, possibly over a specialized high-precision network.

Enables picosecond accuracy (or at least sub-nano).

➕ show 1 reply

maximinus_thrax • yesterday at 7:37 PM

I wouldn't say it's a 'nightmare'. It's just more complicated than what regular folk think computers work when it comes to time sync. There's nothing nightmareish or scary about this, it's just using the best solution for your scenario, understanding limitations and adjusting expectations/requirements accordingly, perhaps relaxing consistency requirements.

I worked on the NTP infra for a very large organization some time ago and the starriest thing I found was just how bad some of the clocks were on 'commodity hardware' but this just added a new parameter for triaging hardware for manufacturer replacement.

This is an ok article but it's just so very superficial. It goes too wide for such a deep subject matter.

➕ show 3 replies

Asmod4n • yesterday at 9:45 PM

Does wall clock time matter for anything but logging? For everything else one could just create any form „time“ to keep stuff in sync, no?

➕ show 1 reply

a_t48 • yesterday at 9:08 PM

Clock sync is such a nightmare in robotics. Most OSes happily will skew/jump to get the time correct. Time jumps (especially backwards) will crash most robotics stacks. You might decide to ensure that you have synced time before starting the stack. Great, now your timestamps are mostly accurate, except what happens when you've used GPS as your time source, and you start indoors? Robot hangs forever.

Hot take: I've seen this and enough other badly configured time sync settings that I want to ban system time from robotics systems - time from startup only! If you want to know what the real world time was for a piece of data after, write what your epoch is once you have a time sync, and add epoch+start time.

➕ show 3 replies

didgetmaster • today at 2:03 AM

Reminds me of the old saying: 'If you have just one watch/clock, then you always know what time it is; but if you have two of them, then you are never sure!'

koudelka • yesterday at 6:38 PM

the Huygens algorithm is also worth a look

https://www.usenix.org/system/files/conference/nsdi18/nsdi18...

emptybits • yesterday at 7:06 PM

Normally I would nod at the title. Having lived it.

But I just watched/listened to a Richard Feynmann talk on the nature of time and clocks and the futility of "synchronizing" clocks. So I'm chuckling a bit. In the general sense, I mean. Yes yes, for practical purposes in the same reference frame on earth, it's difficult but there's hope. Now, in general ... synchronizing two clocks is ... meaningless?

https://www.youtube.com/watch?v=zUHtlXA1f-w

➕ show 5 replies

hinkley • yesterday at 7:51 PM

Vector clocks are one of the other things Barbara Liskov is known for.

user3939382 • today at 1:33 PM

That’s because neither discrete time nor synchronous network comms exist.

shomp • yesterday at 10:33 PM

Absolute synchronization impossible?? Challenge accepted.

➕ show 1 reply

sreekanth850 • today at 6:08 AM

In physics, time is local and relative, independent events don’t need a global ordering. Distributed databases shouldn’t require one either. The idea of a single global time comes from 1980s single-node database semantics, where serializability implied one universal execution order. When that model was lifted into distributed systems, researchers introduced global clocks and timestamp coordination to preserve those guarantees, not because distributed systems fundamentally need it. It’s time we rethink this., Only operations that touch the same piece of data require ordering. Everything else should follow causality like the physical universe, independent events don’t need to agree on sequence, only dependent ones do. Global clocks exist because some databases forced serializable cross-object transactions onto distributed systems, not because nature requires it. Edit: I welcome for a discussion with people who disagree and downvote.

➕ show 1 reply

forrestthewoods • yesterday at 8:40 PM

Timesync isn’t a nightmare at all. But it is a deep rabbit hole.

The best approach, imho, is to abandon the concept of a global time. All timestamps are wrt a specific clock. That clock will skew at a rate that varies with time. You can, hopefully, rely on any particular clock being monotonous!

My mental model is that you form a connected graph of clocks and this allows you to convert arbitrary timestamps from any clock to any clock. This is a lossy conversion that has jitter and can change with time. The fewer stops the better.

I kinda don’t like PTP. Too complicated and requires specialized hardware.

This article only touches on one class of timesync. An entirely separate class is timesync within a device. Your phone is a highly distributed compute system with many chips each of which has their own independent clock source. It’s a pain in the ass.

You also have local timesync across devices such as wearables or robotics. Connecting to a PTP system with GPS and atomic clocks is not ideal (or necessary).

TicSync is cool and useful. https://sci-hub.se/10.1109/icra.2011.5980112

➕ show 3 replies

yapyap • yesterday at 7:25 PM

Love learning new things. This also explains why my casio clock sync starts skewing over time

jeffbee • yesterday at 6:50 PM

PTP requires support not only on your network, but also on your peripheral bus and inside your CPU. It can't achieve better-than-NTP results without disabling PCI power saving features and deep CPU sleep states.

➕ show 3 replies

hinkley • yesterday at 8:12 PM

> Google faced the clock synchronization problem at an unprecedented scale with Spanner, its globally distributed database. They needed strong consistency guarantees across data centers spanning continents, which requires knowing the order of transactions.

> Here’s a video of me explaining this.

Do you need a video? Do we need a 42 minute video to explain this?

I generally agree with Feynman on this stuff. We let explanations be far more complex than they need to be for most things, and it makes the hunt for accidental complexity harder because everything looks almost as complex as the problems that need more study to divine what is actually going on there.

For Spanner to be useful they needed a high transaction rate and in a distributed system that requires very tight grace periods for First Writer Wins. Tighter than you can achieve with NTP or system clocks. That’s it. That’s why they invented a new clock.

Google puts it this way:

Under external consistency, the system behaves as if all transactions run sequentially, even though Spanner actually runs them across multiple servers (and possibly in multiple datacenters) for higher performance and availability.

But that’s a bit thick for people who don’t spend weeks or years thinking about distributed systems.

alt Hacker News

Clock synchronization is a nightmare

Comments