logoalt Hacker News

AlienRobotyesterday at 10:29 PM1 replyview on HN

>The problem now doubles with the introduction of UTF-8. Your string size is in bytes and you need to track characters separately.

That isn't really a problem.

The problem with null-terminated strings is specifically what happens when you reach the end of the allocated array and there ISN'T a NULL character.

Every string function is designed to keep going until it finds the NULL character, so if a hacker gets rid of the NULL character, he can exploit pretty much any standard string manipulation function being used elsewhere in the program to manipulate whatever memory comes AFTER the string data structure.

No other data structure works like this. You can't mess this up in an array, because no function that manipulates arrays is just going to keep going until there is a null. That would be stupid because it would require users of the function to add a NULL to the end of their arrays before passing it to the function, so instead we just pass the size of the array to everything. Strings are the only data structure that assume there will be a NULL at end.

By the way, I read once that if you use UTF-32 every code point will be 4 bytes, constantly, but even then a single code point isn't necessarily a single character. Text is just complicated.


Replies

tredre3yesterday at 11:10 PM

> No other data structure works like this.

In C most data structures work like this, you keep going until you find NUL (character) or NULL (pointer). E.g. Strings, array of pointers, linked lists, etc. Of course you can add length to most of those, but it isn't the canonical/traditional way of doing things.

show 1 reply