Great write-up. Do most modern languages handle invalid surrogates gracefully, or is it still a &quo...

agus4nas • today at 3:26 PM • 2 replies • view on HN

Great write-up. Do most modern languages handle invalid surrogates gracefully, or is it still a "good luck" situation depending on the runtime?

Replies

amluto • today at 3:38 PM

Modern string libraries largely use UTF-8 [0], and surrogates, regardless of whether they’re paired, are invalid in UTF-8. So, in a modern string library, as built in to most modern languages, you will not encounter surrogates except when translating between encodings.

[0] But everyone disagrees as to what indexing a string means, so you need to make an actual choice if you want anything involving indexing to match across languages.

➕ show 1 reply

georgemandis • today at 3:39 PM

The language handled it fine. It will generally just show replacement characters (�) for combos that don't map to anything.

It was really `encodeURIComponent` that didn't handle it gracefully.

If you just type this into the console (surrogate pair for cowboy smiley face emoji), you see it encodes it ("%F0%9F%A4%A0"):

encodeURIComponent("\uD83E\uDD20")

If you give it an invalid surrogate pair, it will throw an actual error:

encodeURIComponent("\uDD20\uD83E")

➕ show 1 reply

alt Hacker News

Replies