It's common for compilers to generate mildly unusual code because they translate high-level code into an abstract intermediate notation, run a variety optimization steps on that notation, and then emit machine-specific code to perform whatever the optimizations yielded. There's no constraint along the lines of "but select the most logical opcode for this task".
The claim that the code is inefficient is really not substantiated well in this blog post. Sometimes, long-winded assembly actually runs faster because of pipelining, register aliasing, and other quirks. Other times, a "weird" way of zeroing a register may actually take up less space in memory, etc.
In my experience C++ abstractions give the optimizer a harder job and thus it generates worse code. In this case, different code is emitted by clang if you write a C version[0] versus C++ original[1].
Usually abstraction like this means that the compiler has to emit generic code which is then harder to flow through constraints and emit the same final assembly since it's less similar to the "canonical" version of the code that wouldn't use a magic `==` (in this case) or std::vector methods or something else like that.
Compilers also like to unnecessarily copy data to stack: https://github.com/llvm/llvm-project/issues/53348 Which can be particularly annoying in cryptographic code where you want to minimize number of copies of sensitive data.
With `u32` as the element type, rustc 1.93 (with `-O`) does the correct thing for size=1, checks both elements separately (i.e. worse than in the article) for size=2, checks all three elements separately (i.e. not being crazy like in the article) for size=3, and starts doing SIMD at size=4.
The OP should try with -march=native so the compiler can use vector instructions.
Slightly off-topic but I like this way to test if memory is all zeroes: https://rusty.ozlabs.org/2015/10/20/ccanmems-memeqzero-itera... (see "epiphany #2" at the bottom of the page) I really wish there was a standard libc function for it.