| alt Hacker News

daxfohl • last Thursday at 11:11 PM • 0 replies • view on HN

https://gist.github.com/daxfohl/9bedd7cad47591dd66271bb0ce2e...

It simulates storing a tree in a database, where users can move a node from one parent to another, and those requests can happen in parallel, with the requirement that the tree is never more than N deep, and does not contain cycles.

The algorithm maintains a "height" field on each node in the tree (so leaf nodes are all height=0, other nodes have height=max(child height) + 1, and the root node always has the height of the full tree).

The no-cycles and max-height invariants must be maintained always, but the "height" field itself can be eventually consistent in the following way.

When moving a node, it first queries the new parent's ancestor chain to check if the move will exceed the max height bounds. It also does a cycle check, and rejects the move if either check fails. Then it performs an atomic update on all the height fields that change in the new parent's ancestor chain: newParent.height = max(newParent.height, movedNode.height+1), and propagate up the ancestor chain until an ancestor's height is unchanged. This update uses optimistic locking via version checks/CAS (the read is one transaction and the write is a separate one), and if any one of those nodes has changed between the read and the write, the write transaction is aborted and the request is rejected. (In production we retry 3x, but that would've been unnecessary complexity in the model). (Note, every time I say "lock" below, I mean optimistic version CAS concurrency control; there are no hard locks anywhere.)

If the transaction succeeds, then it enqueues a job on the old parent, to update its own height based on "max(children.height) + 1". In that job, if the old parent's height is found to be unaffected by the move, then it stops there. Otherwise it updates its height (optimistic lock again) and enqueues another job to update its own parent, and so on up the ancestry chain, until either an ancestor is unchanged or it gets to the root. These jobs are also done with optimistic locks (checking the node's version field), but unlike logic for the new parent's updates, these will retry until they succeed (eventually they will).

The key to the algorithm is that height increases are all done transactionally up front: only the new parent ancestor chain will ever increase height; the old parent's height can only decrease. That ensures the max-height invariant is never broken. The old parent is always a decrease, so it's fine to be eventually consistent. It was also important for us to allow eventual consistency here because a node could have a ton of children, so "max(children.height)+1" for the whole old parent's ancestor chain was too slow to be done in atomically. Yes it could lead to some move requests getting rejected unnecessarily because the requested new parent's height was stale, but that was deemed okay (and should be very rare in the real world), whereas too-high trees were absolutely forbidden and required atomic checking (and fortunately height increases, when done transactionally, only require querying ancestors, so never more than maxHeight records).

Line 74 is probably the most interesting one. In the first step, where we calculate the new heights of the new parent and its ancestors, well, first of all we don't want to lock all the ancestors if possible, because that includes the root, so if we did that we'd be locking the whole tree. Thus we only lock the ancestors whose heights will be changed. But it turns out that's insufficient. You have to lock the ancestors whose heights have changed, plus one more, the first one that hasn't changed, to indicate its invariance. Otherwise a parallel background resync could lower it and create a conflict. That's what line 74 does; adds it to the concurrency check. Change {parent} to {} and TLC will find a failure case.

AFAIK there's no need to lock/check higher ancestors. I can't think of a race condition that would affect them without also affecting the one in line 74, which would still abort the request. But it'd be nice to have some stronger evidence. Currently the simulation only scales up to ~6 nodes and 6 moves before simulation time explodes from seconds to years. That's enough to catch the line 74 requirement, but 6 moves is pretty short and I've had to work out a few cases by hand that required more than 6 moves.

The main perf improvement I did was the symmetry declaration on the node ids. I also tried hand-rolling some symmetry, where instead of using TLA's symmetry I just had the "move" function select from the first "moveCount" nodes, since it doesn't matter which node you move first, so it may as well be node[0], and the second move could either be node[0] or node[1], and so on (see revision 3 of the gist). Both of these had about the same effect on performance.

Anyway feel free to look. If you have ideas how to speed it up, LMK.