logoalt Hacker News

chasilyesterday at 6:10 PM6 repliesview on HN

Windows is also a rare bird in UTF-16.

"UTF-16 is used by the Windows API, and by many programming environments such as Java and Qt. The variable-length character of UTF-16, combined with the fact that most characters are not variable-length (so variable length is rarely tested), has led to many bugs in software, including in Windows itself.

"UTF-16 is the only encoding (still) allowed on the web that is incompatible with 8-bit ASCII. It has never gained popularity on the web, where it is declared by under 0.004% of public web pages (and even then, the web pages are most likely also using UTF-8). UTF-8, by comparison, gained dominance years ago and accounted for 99% of all web pages by 2025."

https://en.wikipedia.org/wiki/UTF-16


Replies

electrolytoday at 12:41 AM

UTF-16 is the internal format of the ICU library (International Components for Unicode, the support library from the Unicode standards people) which is a common way to add "full fat" Unicode support to a programming language. This has knock-on effects everywhere. If you're using ICU, you either use UTF-16, too, or you constantly convert back and forth every time you interact with ICU. You're often best off using UTF-16 in memory and only converting to UTF-8 when you write files or transmit over the network.

0x1d7yesterday at 6:56 PM

NT shipped with USC-2 as UTF-8 (and -16) did not yet exist. USC-2 naturally translated to UTF-16, hence the choice. NT/Win32 is also designed for fixed-with code units, something UTF-8 doesn't support.

You can use UTF-8 on a per-application basis, within limits.

https://learn.microsoft.com/en-us/windows/apps/design/global...

Conversely, UEFI is UTF-16 only, thanks to Windows.

UTF-8 only would be an ABI breaking change, so that's not going to happen. We don't want the NT kernel to end up like Linux, after all :-)

show 1 reply
zahllosyesterday at 6:20 PM

Additional Detail: it is specifically utf-16 little endian when a byte order mark is not used, which is the opposite of the recommended choice of big endian in the RFC.

Worse are the byte order marks required to support both endians that end up in files.

show 1 reply
Dwedityesterday at 6:16 PM

UTF-16 is also used by C#, Java, and JavaScript. Since JavaScript is so widely adopted, I wouldn't call it a rare bird. Not necessarily used when reading or writing files, but it's what's used internally for the strings. As a result, your strings use UTF-16 surrogate pairs to represent characters outside of the basic multilingual plane (such as Emoji).

wvenableyesterday at 6:55 PM

> Windows is also a rare bird in UTF-16.

Web browsers use UTF-16 internally. So Windows isn't even largest "platform" that uses UTF-16.

bvanheuyesterday at 8:45 PM

> Windows is also a rare bird in UTF-16.

an interesting tidbit, some Windows kernel developer realized that most registry keys are ascii anyways so they could save up to 50% space simply by storing the name as ascii. The flag is called "compressed name" and they will pad with 0x00 when reading the name to make a proper utf-16 string.