logoalt Hacker News

delta_p_delta_xtoday at 3:17 AM3 repliesview on HN

The zero-terminated string is by far C's worst design decision. It is single-handedly the cause for most performance, correctness, and security bugs, including many high-profile CVEs. I really do wish Pascal strings had caught on earlier and platform/kernel APIs used it, instead of an unqualified pointer-to-char that then hides an O(n) string traversal (by the platform) to find the null byte.

There are then questions about the length prefix, with a simple solution: make this a platform-specific detail and use the machine word. 16-bit platforms get strings of length ~2^16, 32 b platforms get 2^32 (which is a 4 GB-long string, which is more than 1000× as long as the entire Lord of the Rings trilogy), 64 b platforms get 2^64 (which is ~10^19).

Edit: I think a lot of commenters are focusing on the 'Pascalness' of Pascal strings, which I was using as an umbrella terminology for length-prefixed strings.


Replies

david2ndaccounttoday at 3:34 AM

Pascal strings might be the only string design worse than C strings. C Strings at least let you take a zero copy substring of the tail. Pascal strings require a copy for any substring! Strings should be two machine words - length + pointer (aka what is commonly called a string view). This is no different than any other array view. Strings are not a special case.

show 6 replies
theamktoday at 3:47 AM

First common 32 bit system was Win 95, which required 4MB of RAM (not GB!). The 4-byte prefix would be considered extremely wasteful in those times - maybe not for a single string, but anytime when there is a list of strings involved, such as constants list. (As a point of reference, Turbo Pascal's default strings still had 1-byte length field).

Plus, C-style strings allow a lot of optimizations - if you have a mutable buffer with data, you can make a string out of them with zero copy and zero allocations. strtok(3) is an example of such approach, but I've implemented plenty of similar parsers back in the day. INI, CSV, JSON, XML - query file size, allocate buffer once, read it into the buffer, drop some NULL's into strategic positions, maybe shuffle some bytes around for that rare escape case, and you have a whole bunch of C strings, ready to use, and with no length limits.

Compared to this, Pascal strings would be incredibly painful to use... So you query file size, allocate, read it, and then what? 1-byte length is too short, and for 2+ byte length, you need a secondary buffer to copy string to. And how big should this buffer be? Are you going to be dynamically resizing it or wasting some space?

And sure, _today_ I no longer write code like that, I don't mind dropping std::string into my code, it'd just a meg or so of libraries and 3x overhead for short strings - but that's nothing those days. But back when those conventions were established, it was really really important.

show 4 replies
jmyeettoday at 4:40 AM

The C string and C++'s backwards compatibility supporting it is why I think both C and C++ are irredeemable. Beyond the bounds overflow issue, there's no concept of ownership. Like if you pass a string to a C function, who is responsible for freeing it? You? The function you called? What if freeing it is conditional somehow? How would you know? What if an error prevents that free?

C++ strings had no choice but to copy to underlying string because of this unknown ownership and then added more ownership issues by letting you call the naked pointer within to pass it to C functions. In fact, that's an issue with pretty much every C++ container, including the smart pointers: you can just call get() an break out of the lifecycle management in unpredictable ways.

string_view came much later onto the scene and doesn't have ownership so you avoid a sometimes unnecessary copy but honestly it just makes things more complex.

I honestly think that as long as we continue to use C/C++ for crucial software and operating systems, we'll be dealing with buffer overflow CVEs until the end of time.

show 1 reply