logoalt Hacker News

HelloNurseyesterday at 3:20 PM1 replyview on HN

While, for the purpose of avoiding gratuitous mistakes, C is a serious disadvantage compared to less low-level languages, your discussion of UB pitfalls in C is aimed at a strawman.

First of all, traffic rules are good, and similar to good C programming rules: check number value ranges when there is a chance of casting or overflow, check Inf and NaN floating point values, declare alignment strategically (e.g. in all memory allocations) to avoid misaligned pointers and variables, and so on. Such rules have alternatives and exceptions and must not be part of the language.

Second, nobody needs perfection and "UB-freeness": it is reasonable to assume that many cases of UB won't be a problem, either because a library will be used correctly and they won't happen, or because the C implementation is neither weird nor hostile and they will be as benign as defined or implementation defined behaviour, or simply because we avoid doing something known to be inexact or hard to write correctly.

Practical programming requires knowing the relevant rules for what one is doing and learning new ones by making, diagnosing and overcoming mistakes; not omniscience, and definitely not the unfounded feeling of omniscience and unlimited resources that LLMs can give.

EDIT: I insist on the signed char example because it would be terribly wrong (processing who-knows-what as if it were a sequence of characters) even without undefined behaviour, even in different languages.


Replies

thomashabets2today at 7:17 AM

> Second, nobody needs perfection and "UB-freeness"

Sure. You only care about the ones that manifest security issues, stability issues, or other corruption. But of course those change over time as compilers change.

So while far from every instance of UB will manifest in a problem, every single one has the potential to, by a low percentage. They're all tiny liabilities that add up.

But which ones will? Reminds me of https://www.lesswrong.com/posts/ooypcn7qFzsMcy53R/infinite-c...

> because the C implementation is neither weird nor hostile

Some people definitely were screaming at GCC for being hostile when it removed the NULL check in the kernel:

    int foo = bar->baz;
    if (!bar) {
      return -EINVAL;
    }
> the unfounded feeling of omniscience and unlimited resources that LLMs can give.

I definitely don't have that. I'm not saying LLMs find all bugs (now or in the future), nor that they are an unlimited resource.

I'm just saying that for finding UB and subtle bugs, they find orders of magnitude more, especially in C and C++.

I am not saying they find a strict superset of bugs, compared to a human. But take me running this against cosmopolitan libc: https://news.ycombinator.com/item?id=48206377. It took me basically zero human time to spin it off, it took a couple of minutes (5.5 in xhigh effort) to run, and found 5-10 cases of UB, one of which I think is a user visible parsing error of SSH keys. Another is a set of double-free, which is definitely a thing that gets exploited over and over.

Would I have found these, in an unknown-to-me codebase no less, given manual source code reading all day? Of course not. Would I have found it with the likes of UBSAN? jart claims to have used it (https://news.ycombinator.com/item?id=48205545), and apparently didn't.

LLMs are just one of the tools to use. A tool that does better than any tool or human has done in the last half century.

> I insist on the signed char example because it would be terribly wrong

The char situation is terrible in C. It's perfectly safe to hold bytes in a char, signed char, and unsigned char, and convert between them. But then integer promotion rules combine with the historical choice of having isdigit take an int to break things.

If isdigit took a char, of any signedness, then there wouldn't be a problem. But that EOF ruins it.

> processing who-knows-what as if it were a sequence of characters

A "char" hasn't been "a character" in any meaningful sense in a long long time. Or rather, "a character" is not a code point or grapheme cluster. For byte processing, since they cast perfectly fine, it's fine. Or do you have some interesting example?