logoalt Hacker News

hgs312/09/20243 repliesview on HN

I'm looking at the JSON5 spec and it appears it does not introduce a capital \U escape sequence for Unicode characters outside the Basic Multilingual Plane (BMP). It's not brought up often, but in JSON you do need UTF-16 surrogates to write an escape sequence for Unicode characters outside the BMP. Consider the Hamburger Emoji (U+1F354). Instead of escaping it as "\U0001F354", you need to escape it with UTF-16 surrogates "\uD83C\uDF54". This is both cumbersome for humans and not in accordance with the Unicode Standard [1]. It's ironic, but many (most?) of the "JSON for Humans" flavors of JSON tend to overlook this.

[1] See Chapter 3.8 "Surrogates" of the Unicode Standard.


Replies

maxloh12/09/2024

When you export Instagram data as JSON, the resulting JSON files include encoded strings like "\uD83C\uDF54".

Parsing and converting these strings can be cumbersome because a single Unicode character is often represented by a single escape sequence, but sometimes it requires two.

Dylan1680712/09/2024

How often are humans going to be using unicode escape sequences?

show 1 reply
fanf212/09/2024

There’s no \U in JavaScript: it is spelled \u{10ffff}