In order to implement grapheme cluster segmentation, you have to start with a sequence of Unicode sc...

kbolino • last Tuesday at 4:33 PM • 1 reply • view on HN

In order to implement grapheme cluster segmentation, you have to start with a sequence of Unicode scalars. In practice, that means a sequence of 32-bit integers, which is UTF-32 in all but name. It's not a good interchange format, but it is a necessary intermediate/internal format.

There's also the problem that grapheme cluster boundaries change over time. Unicode has become a true mess.

Replies

mort96 • last Tuesday at 4:43 PM

Yeah, you need some kind of sequence of Unicode scalars. But there's no reason for that sequence to be "a contiguous chunk of memory filled with 32-bit ints" (aka a UTF-32 string); it can just as well be an iterator which operates on an in-memory UTF-8 string and produces code points.

alt Hacker News

Replies