In summary, Unicode code points (characters) are 32 bit. JavaScript manipulates Unicode in utf-16 fo...

bombela • today at 3:57 PM • 1 reply • view on HN

In summary, Unicode code points (characters) are 32 bit. JavaScript manipulates Unicode in utf-16 for historical reasons, because at some point before Unicode, 16 bit was deemed enough (ucs-2). utf-16 run length encodes Unicode 32 codepoints into one or two code units. Splitting in a middle of a codepoints produces one invalid half string, and one semantically different half string.

emojies are a sequence of Unicode codepoints producing a single grapheme. Splitting in the middle of a grapheme will produce two valid strings, but with some funky half baked emoji. So for a text editor it makes sense to split between grapheme boundaries.

Replies

chrismorgan • today at 4:25 PM

> Unicode code points are 32 bit

21-bit, actually. It was supposed to be 32-bit, but UTF-16 caps out at 21-bit, so they lopped eleven bits of potential from Unicode (and UTF-8, so no more six-byte encoding).

> at some point before Unicode

No, in the early days of Unicode.

> run length encodes

Um… what? RLE is a data compression thing, UTF-16 has nothing to do with it.

➕ show 1 reply

alt Hacker News

Replies