A CRDT library working at the code unit level? Ouch. Of course that’s going to go wron...

chrismorgan • today at 4:24 PM • 0 replies • view on HN

A CRDT library working at the code unit level? Ouch. Of course that’s going to go wrong, it was inevitable.

As for using extended grapheme clusters, it sounds a little bit iffy—maybe possible to use correctly, maybe not, because they’re not stable over time. That style of thing has created some fascinating bugs, like (a few years ago) index corruption in PostgreSQL due to collation changes.

Unicode scalar values are technically-safe: you can’t introduce invalid Unicode. But you can definitely still end up with nonsense.

> We made emoji an atomic node type.

That avoids problems for emoji, but leaves the underlying hazard untouched. I imagine it could still theoretically occur with other text, probably CJK. But probably only theoretically.

> This splits by grapheme clusters rather than code units. No orphaned surrogates, no split emoji. It's what .slice() should have been doing all along, but of course UTF-16 predates emoji by decades.

I do not agree that slice() should operate on extended grapheme clusters. Don’t lump the grapheme cluster/scalar value split in with the sins of UTF-16 and its unreliable code point/code unit split.

UTF-16 was unforced error (and I still can’t work out why it wasn’t obvious from the start that UCS-2 would never be enough). But the concept of multiple scalars contributing to the logical unit was always inevitable.

alt Hacker News