The times I've used TLA+ in anger have been generally related to lower-level "communicatin...

tonyarkles • yesterday at 5:58 PM • 0 replies • view on HN

The times I've used TLA+ in anger have been generally related to lower-level "communicating state machines". One was a LoRaWAN handshake with the possibility of lost radio packets, another was a slightly-higher-level Bluetooth message exchange between a phone and a device.

In the first situation I wanted to ensure that the state machine I'd put together couldn't ever deadlock. There's a fairness constraint you can turn in on TLA+ that basically says "at some point in every timeline it will choose every path instead of just looping on a specific failure path". The model I put together ensured that, assuming at some point there was a successful transmission and reception of packets, the state machines on both sides could handle arbitrary packet loss without getting into a bad state indefinitely.

On the Bluetooth side it was the converse: a vendor had built the protocol and I believed that it was possible to undetectably lose messages based on their protocol. The model I put together was quite simple and I used it to give me the sequence of events that would cause a message to be lost; I then wrapped that sequence of events up in an email to the vendor to explain a minimal test case to prove that the protocol was broken.

For your specific example, I'd suggest you think about invariants instead of actions. I'd probably split your example into two distinct models though: one for the connection retry behaviour and one for the queued edits.

The invariant for the first model would be pretty straightforward: starting from the disconnected state, ensure that all (fair) paths eventually lead to the connected state. The fairness part is necessary because without it you can end up in an obvious loop of AttemptConnect -> Fail -> AttemptConnect -> Fail... that never ends up connected. A riff on that that gets a little more interesting was a model I put together for an iOS app that was talking to an Elixir Phoenix backend over Phoenix Channels. Not only did it need to end up with a connected websocket but it also needed to successfully join a channel. There were a few more steps to it and the model ended up being somewhat interesting.

The second model in your example is the edits model. What are the invariants here? We completely ignore the websocket part of it and just model it as two (or three!) state machines with a queue in between them. The websocket model handles the reconnect logic. The key invariant here is that edits aren't lost. In the two-state-machine model you're modelling a client that's providing edits and a server that's maintaining the overall state. An obvious invariant is that the server's state eventually reflects the client's queued edits.

The three-state-machine model is where you're going to get more value out of it. Now you've got TWO clients making edits, potentially on stale copies of the document. How do you handle conflicts? There's going to be a whole sequence of events like "client 1 receives document v123", "client 2 receives document v123", "client 2 makes an edit to v123 and sends to server", "server accepts edit and returns v124", "client 1 makes an edit to v123", "server does...?!"

I haven't reviewed this model at all but a quick-ish google search found it. This is likely somewhat close to how Google Docs works under the hood: https://github.com/JYwellin/CRDT-TLA

alt Hacker News