logoalt Hacker News

amlutotoday at 3:38 PM1 replyview on HN

Modern string libraries largely use UTF-8 [0], and surrogates, regardless of whether they’re paired, are invalid in UTF-8. So, in a modern string library, as built in to most modern languages, you will not encounter surrogates except when translating between encodings.

[0] But everyone disagrees as to what indexing a string means, so you need to make an actual choice if you want anything involving indexing to match across languages.


Replies

chuckadamstoday at 4:05 PM

> surrogates, regardless of whether they’re paired, are invalid in UTF-8

Java did not get the memo. Since the char type is fixed at 16 bits, it uses surrogates to encode everything outside the BMP, regardless of the encoding.