> Second, nobody needs perfection and "UB-freeness"
Sure. You only care about the ones that manifest security issues, stability issues, or other corruption. But of course those change over time as compilers change.
So while far from every instance of UB will manifest in a problem, every single one has the potential to, by a low percentage. They're all tiny liabilities that add up.
But which ones will? Reminds me of https://www.lesswrong.com/posts/ooypcn7qFzsMcy53R/infinite-c...
> because the C implementation is neither weird nor hostile
Some people definitely were screaming at GCC for being hostile when it removed the NULL check in the kernel:
int foo = bar->baz;
if (!bar) {
return -EINVAL;
}
> the unfounded feeling of omniscience and unlimited resources that LLMs can give.I definitely don't have that. I'm not saying LLMs find all bugs (now or in the future), nor that they are an unlimited resource.
I'm just saying that for finding UB and subtle bugs, they find orders of magnitude more, especially in C and C++.
I am not saying they find a strict superset of bugs, compared to a human. But take me running this against cosmopolitan libc: https://news.ycombinator.com/item?id=48206377. It took me basically zero human time to spin it off, it took a couple of minutes (5.5 in xhigh effort) to run, and found 5-10 cases of UB, one of which I think is a user visible parsing error of SSH keys. Another is a set of double-free, which is definitely a thing that gets exploited over and over.
Would I have found these, in an unknown-to-me codebase no less, given manual source code reading all day? Of course not. Would I have found it with the likes of UBSAN? jart claims to have used it (https://news.ycombinator.com/item?id=48205545), and apparently didn't.
LLMs are just one of the tools to use. A tool that does better than any tool or human has done in the last half century.
> I insist on the signed char example because it would be terribly wrong
The char situation is terrible in C. It's perfectly safe to hold bytes in a char, signed char, and unsigned char, and convert between them. But then integer promotion rules combine with the historical choice of having isdigit take an int to break things.
If isdigit took a char, of any signedness, then there wouldn't be a problem. But that EOF ruins it.
> processing who-knows-what as if it were a sequence of characters
A "char" hasn't been "a character" in any meaningful sense in a long long time. Or rather, "a character" is not a code point or grapheme cluster. For byte processing, since they cast perfectly fine, it's fine. Or do you have some interesting example?