The UB in unaligned pointers is even worse: an unaligned pointer in itself is UB, not only an access to it. So even implicit casting a void*v to an int*i (like 'i=v' in C or 'f(v)' when f() accepts an int*) is UB if the cast pointer is not aligned to int.
It is important to understand that this is a C level problem: if you have UB in your C program, then your C program is broken, i.e., it is formally invalid and wrong, because it is against the C language spec. UB is not on the HW, it has nothing to do with crashes or faults. That cast from void* to int* most likely corresponds to no code on the HW at all -- types are in C only, not on the HW, so a cast is a reinterpretation at C level -- and no HW will crash on that cast (because there is not even code for it). You may think that an integer value in a register must be fine, right? No, because it's not about pointers actually being integers in registers on your HW, but your C program is broken by definition if the cast pointer is unaligned.
The 5 stages of learning about UB in C:
-Denial: "I know what signed overflow does on my machine."
-Anger: "This compiler is trash! why doesn't it just do what I say!?"
-Bargaining: "I'm submitting this proposal to wg14 to fix C..."
-Depression: "Can you rely on C code for anything?"
-Acceptance: "Just dont write UB."
The examples aren't really undefined behavior. They are examples that could become UB based on input/circumstances. Which if you are going to be that generous, every function call is UB because it could exceed stack space. Which is basically true in any language (up to the equivalent def of UB in that language). I feel like c has enough actual rough edges that deserve attention that sensationalism like this muddies folks attention (particularly novices) and can end up doing more harm than good.
The problem of UB is not really that it may crash in some architecture. The real problem is that the compiler expects UB code to NOT happen, so if you write UB code anyway the compiler (and especially the optimizer) is allowed to translate that to anything that's convenient for its happy path. And sometimes that "anything" can be really unexpected (like removing big chunks of code).
I have never in my 20 years of writing C heard so much about undefined behavior as I have in the past 6 months on Hacker News. It has never entered the conversation. You write the code. If it doesn't work, you debug it and apply a fix or a workaround. Why does the idea of undefined behavior in C get to the front page so consistently?
Some of the C++ code in this article has not been idiomatic in over a decade, and would be considered a code smell today. The language has evolved into quite a different language than when it was first created. As soon as I saw all of those raw pointers and direct pointer access, it was clear that at least part of this article should be taken with a grain of salt.
The other obvious issue with the overall perspective is that C and C++ are being thrown together directly as if somehow they’re nearly the same language, but they are really very far apart nowadays.
As much as I agree with the intro, these examples aren't good and the overall article is just a veil for pushing LLM coding.
Is this a correct understanding of UB in C? A program P has a set of inputs A that do not trigger UB, and a complementary set of inputs B that do trigger UB. A correct compiler compiles P into an executable P'. For all inputs in A, P' should behave the same as P. However, for any input in B, the is absolutely no requirements on the behavior of P'.
A concrete example of undefined behavior caused by an unaligned pointer: https://pzemtsov.github.io/2016/11/06/bug-story-alignment-on...
> A problem with this is that in order to confirm the findings, you’ll need an expert human. But generally expert humans are busy doing other things.
The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.
LLM generated code will eventually contain UB.
EDIT: added "eventually"
Very bad advice. Of course good new LLM's know about UB, but you still need to use ubsan (ie - fsanitize=undefined), and not your LLM.
Can anyone explain why this is undefined behaviour? UBSan calls it "indirect call of a function through a function pointer of the wrong type"
struct foo {int i;};
int func(struct foo *x) {return x->i;}
int main() {
int (*funcptr)(void*) = (int (*)(void*)) &func;
struct foo foo = { 42 };
return funcptr(&foo);
}
While this is all kosher per the language lawyers: struct foo {int i;};
int func(void *x) {return ((struct foo *)x)->i;}
int main() {
int (*funcptr)(void*) = &func;
struct foo foo = { 42 };
return funcptr(&foo);
}Well, you can't write malloc in conforming C, which hurts rather more than remembering to write bitcast as memcpy on char pointers.
Doesn't matter though because you aren't writing standards conforming C. You're writing whatever dialect your compilers support, and that's probably (module bugs) much better behaved than the spec suggests.
Or you're writing C++ and way more exposed to the adversarial-and-benevolent compiler experience.
The type aliasing rules are the only ones that routinely cause me much annoyance in C and there's always a workaround, whether if it's the launder intrinsic used to implement C++, the may_alias attribute or in extremis dropping into asm. So they're a nuisance not a blocker.
For a deep dive on UB with printf, see https://srs.fyi/see-conversions/
> When programming in C, to avoid unexpected pitfalls, one must be acutely aware of a whole slew of implicit behaviors (some of which are implementation-defined or even undefined).
"My point is that ALL nontrivial C and C++ code has UB."
Is "nontrivial" defined
How would one identify "nontrivial" C code
Is there an objective measure (defined)
Or is it a matter of personal opinion that could vary from person to person (undefined)
> The compiler, and really the underlying hardware too, is playing a game of telephone with your UB intentions.
The part about hardware is wrong BTW. In all the cases about null pointers and out-of-bounds access and integer overflow and whatnot, the hardware semantics are clearly defined, and the assembler code does exactly what is written. The way modern compilers act on your code makes C less safe than assembler in that sense.
Integer promotion seems to be the source of many signed integer overflow UB. Why does C have it? Does integer promotion ever have a good part?
I read through this in detail... Is it just me, or are these things that are invoked by intentionally bypassing the typing?
I mean, you have to go out of your way and use a cast to get the UB in the first example.
For the `isxdigit` implementation, using a parameter to index into an array without a length check is pretty suspect already. I don't think any of my code actually indexes an array without checking the length in some way.
For the float -> int conversion, converting a float to an int without picking a conversion does not make sense in the first place - math.h has rounding and ceiling functions.
> For all you know the compiler has no internal way to even express your intention here.
I'm human, not a compiler, and even I cannot tell what the intention is behind trying to call NULL as a function. What exactly is expected to happen?
> Because the argument needs to be a pointer, and the NULL macro may be misinterpreted as an integer zero.
I don't think this is true for C. The NULL macro is defined to be a pointer in the C standard, AFAIK. Just because comparisons with zero are allowed, does not imply that the standard implicitly promotes NULL to `int`.
I think only the final one is of note (the 24-bit shift assigned to a uint64_t).
I want a language that is a group of bit (0,1) and the xor operator. Everything else is built on top of that.
Excellent post. But it's addressed to the wrong people.
The problem lies with compilers, not with the language and its specification, or with the creators of the C programming language.
Anyone can write a compiler that transforms all undefined behaviors (UB) into defined behaviors (DB). And your compiler will be used by people, including me.
C is still, by far, the simplest language that we have.
Although many newer languages are safer (with the exclusion of Rust, primarily by being slower) the same kinds of issues that are there in C are there in these languages, their effects are just harder to see.
People complain about C as though they know how to fix it.
The scariest part is how many production systems rely on undefined behavior without anyone knowing until a compiler update breaks everything.
When talking UB, putting C and C++ in the same basket is basically like comparing drunk driving a car and riding a bicycle sober... Both means of transport, very different experience.
C does not abstract differences in underlying hardware well. Systems programmers know if they have an architecture that can't handle unaligned accesses or that the address they are doing load/stores from is a mmio register. Systems programmers know the difference between a virtual address and a physical address and have debugged MPU faults or MMU table walks and page faults more times than they want to think about.
C is horrible for trying to write a portable user-mode program in 2026. There are lots of better options.
C is great for writing low-level system code where you need to optimize performance down to the last cycle. It not abstracting away the hardware is super important for some use cases. A classic example is all of the platform-specific flavors of memcpy in the Linux kernel that are C/assembly hybrids hand-optimized for the SIMD pipelines of some CPUs.
C is a tool, Rust is a tool, Java is a tool, Python is a tool. Use the right tool for the job ¯\_(ツ)_/¯.
A fun one that'd fit list be sequence point violations like
i = i++Maybe we should criminalize writing articles about Undefined Behavior that have a "So what do we do now?" subheader but omit any mention of UBSan.
Very interesting article. I'm in love with C++, and I cannot say that I'm a good developer, but interesting to discover where UB can be. (Sorry I'm not a good english speaker)
Is there a way to avoid undefined behavior Im C then? Could we write a new C compiler that adds some checks and fixes (e.g. raise documented exceptions) to each undefined behavior?
I really like Zig's approach to UB. Especially alignment is a part of type. And all this wordy builtins for conversions. Starring to it makes you think what you doing wrong with data model it requires now 3 lines of casting expression.
Is comparing a signed integer with an unsigned integer UB? I resently wrote some code and compiled it with gcc to x86_64 (without optimization) that returned an incorrect answer.
U just need to read the title and 5 lines to know this must be a rust guy.
shameless plug, it's part of the Nerd Encyclopedia: it's also called "nasal demons".
https://nickyreinert.de/2023/2023-05-16-nerd-enzyklop%C3%A4d...
In C / C++ there are two kinds of undefined behaviour. One is where there is written in standard what UB is. Another one is everthing else that is not in standard.
most languages don't even HAVE a specification so in most languages literally EVERYTHING everything is undefined behavior
The art is actually making sure it all stays defined behavior
Isn't the article mostly saying that SPARC sucks?
if c is more ub unsafe than it seems,what is the solution here
It's also worth highlighting that C is perhaps the most officially standardized programming language in history.
What a contradiction. Strong evidence that standard-driven programming language development is much worse than implementation-driven development. Standards should be used for data types and external interfaces/protocols, not programming languages.
The issue for me with posts like this is that it misses the issue.
Unaligned pointer accesses are UB because different systems handle it differently. This 'should' be to allow the program to be portable by doing what the system normally does.
Instead it's been highjacked by compiler writers, with the logic that "X is UB, therefore can't happen, therefore can be optimised away."
Int c = abs(a) + abs(b); If (a > c) //overflow
Is UB because some system might do overflow differently. In practice every system wraps around.
That should be a valid check, instead it gets optimised away because it 'can't' happen.
C gives you enough rope to hang yourself. The compiler writers don't trust you to use the rope properly.
How can it be valid implementation of isxdigit?
``` int isxdigit(int c) { if (c == EOF) { return false; } return some_array[c]; } ```
If you write code like this, then everything in programming is UB.
I stoped reading about here:
> bool parse_packet(const uint8_t* bytes) {
> const int* magic_intp = (const int*)bytes; // UB!
Author, if you are reading this, please cite the spec section explaining that this is UB. Dereferencing the produced pointer may be UB, but casting itself is not, since uint8_t is ~ char and char* can be cast to and from any type.you might try to argue that uint8_t is not necessarily char, and while it is true that implementations of C can exist where CHAR_BIT > 8, but those do not have uint8_t defined (as per spec), so if you have uint8_t, then it is "unsigned char", which makes this cast perfectly safe and defined as far as i can tell. Of course CHAR_BIT is required to be >= 8, so if it is not >8, it is exactly 8. (In any case, whether uint8_t is literally a typedef of unsigned char is implementation-defined and not actually relevant to whether the cast itself is valid -- it is)
maybe rewrite this in go?)
From the ANSI C standard:
3.16 undefined behavior: Behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which this International Standard imposes no requirements. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message).
Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph? The intent here is extremely clear, that undefined behavior means you're doing something not intended or specified by the language, but that the consequence of this should be somewhat bounded or as expected for the target machine. This is closer to our old school understanding of UB.By 'bounded', this obviously ignores the security consequences of e.g. buffer overflows, but just because UB can be exploited doesn't mean it's appropriate for e.g. the compiler to exploit it too, that clearly violates the intent of this paragraph.
feels like https://xkcd.com/1499/
the only people complaining about being able to do awful things are people that do awful things
UB can also have impact in logical cohesion of codebase.
a good case can be made that use of C++ is a SOX violation
So Linus was right? But for a second reason too:
C++ is a horrible language. It’s made more horrible by the fact that a lot of substandard programmers use it, to the point where it’s much, much easier to generate total and utter crap with it. Quite frankly, even if the choice of C were to do _nothing_ but keep the C++ programmers out, that in itself would be a huge reason to use C.
That is, accepting C++ code from programmers who use C++ could be a SOX violation ;-)
Yes there is tons of surprising and weird UB in C, but this article doesn't do a great job of showcasing it. It barely scratches the surface.
Here's a way weirder example:
This is totally fine if x is just an int, but the volatile makes it UB. Why? 5.1.2.4.1 says any volatile access - including just reading it - is a side effect. 6.5.1.2 says that unsequenced side effects on the same scalar object (in this case, x) are UB. 6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other.So in common parlance, a "data race" is any concurrent accesses to the same object from different threads, at least one of which is a write. In C, we can have a data race on a single thread and without any writes!