Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM for every character (possibly to...

Dwedit • 01/17/2025 • 2 replies • view on HN

Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM for every character (possibly to the same address)?

Replies

ack_complete • 01/18/2025

CPUs are surprisingly good at dealing with this in their store queues. I see this write-all-and-increment-some technique used a lot in optimized code, like branchless left-pack routines or overcopying in the copy handler of an LZ/Deflate decompressor.

➕ show 1 reply

ngoldbaum • 01/17/2025

You could do two passes over the string, first get the total length in bytes, then fill it in codepoint by codepoint.

You could also pessimistically over-allocate assuming four bytes per character and then resize afterwards.

With the API in the linked blog post it's up to the user to decide how they want to use the output [u8;4] array.

alt Hacker News

Replies