logoalt Hacker News

We found a bug in Go's ARM64 compiler

652 pointsby jgrahamcyesterday at 1:33 PM99 commentsview on HN

Comments

Neywinyyesterday at 3:38 PM

That's an incredible find and once I saw the assembly I was right along with them on the debug path. Interestingly it doesn't need to be assembly for this to work, it's just that that's where the split was. The IR could've done it, it just doesn't for very good reasons. So another win for being able to read arm assembly.

Unsure if this would be another way to do it but to save an instruction at the cost of a memory access you could push then pop the stack size maybe? Since presumably you're doing that pair of moves on function entry and exit. I'm not really sure what the garbage collector is looking for so maybe that doesn't work, but I'd be interested to hear some takes on it

show 5 replies
pengaruyesterday at 4:30 PM

For the impatient, here's the fix: https://github.com/golang/go/commit/f7cc61e7d7f77521e073137c...

show 2 replies
renewiltordyesterday at 4:53 PM

Great technical blog. Good pathway for narrative, tight examples, description so clear it makes me feel smarter than I am because so easy to follow though the last time I even read assembly seriously was x86 years ago.

Also, fulfills the marketing objective because I cannot help but think that this team is a bunch of hotshots who have the skill to do this on demand and the quality discipline to chase down rare issues.

I assume these are Ampere Altra? I was considering some of those for web servers to fill out my rack (more space than power) but ended up just going higher on power and using Epyc.

riobardyesterday at 4:39 PM

What ARM64 machines are you using and what are they used for? Last year you were announcing Gen 12 servers on AMD EPYC (https://blog.cloudflare.com/gen-12-servers/), but IIRC there weren’t any mentions of ARM64. But now it seems you’re running ARM64 in full production.

show 2 replies
MarkSweepyesterday at 11:26 PM

I wonder if Go had a mode where you make it single step every instruction and trigger a GC interrupt on every opcode. That would make it easier to find these kinds of bugs.

show 1 reply
alberthyesterday at 11:40 PM

I thought Cloudflare was 100% Rust, and x86 (EPYC) these days.

Interesting to hear Go & ARM in use.

show 2 replies
dreamcompileryesterday at 4:12 PM

Always adjust your stack pointer atomically, kids.

show 2 replies
Agingcoderyesterday at 3:51 PM

Excellent article as always from the cloudflare blog - engineering without magic infrastructure and ml. One day I will apply !

Compiler bugs are actually quite common ( I used to find several a year in gcc ), but as the author says, some of them only appear when you work at a very large scale, and most people never dive that far.

show 2 replies
brcmthrowawayyesterday at 6:41 PM

I don't get it, how were the machine threads being stopped in thr middle of two instructions? This is baremetal, right?

show 2 replies
javierhonducoyesterday at 3:06 PM

Really enjoyed reading this. Thanks for writing it!

wat10000yesterday at 5:44 PM

I would have thought that unwinding would use the frame pointer and this wouldn't be a problem.

show 1 reply
quotemstryesterday at 10:03 PM

This problem strikes me more as a debuginfo generation bug than a "compiler" bug.

> After this change, stacks larger than 1<<12 will build the offset in a temporary register and then add that to rsp in a single, indivisible opcode. A goroutine can be preempted before or after the stack pointer modification, but never during. This means that the stack pointer is always valid and there is no race condition.

Seems silly to pessimize the runtime, even slightly, to account for the partial register construction. DWARF bytecode ought to be powerful enough to express the calculations needed for restoring the true stack pointer if we're between immediate adjustments.

show 1 reply
lordnachoyesterday at 7:28 PM

> This was a very fun problem to debug.

I'm sure it was a relief to find a thorough solution that addressed the root cause. But it doesn't seem plausible that it was fun while it was unexplained. When I have this kind of bug it eats my whole attention.

Something this deep is especially frustrating. Nobody suspects the standard library or the compiler. Devs have been taught from a young age that it's always you, not the tools you were given, and that's generally true.

One time, I actually did find a standard library bug. I ended up taking apart absolutely everything on my side, because of course the last hypothesis you test is that the pieces you have from the SDK are broken. So a huge amount of time is spent chasing the wrong lead when it actually is a fundamental problem.

On top of this, the thing is a race condition, so you can't even reliably reproduce it. You think it's gone like they did initially, and then it's back. Like cancer.

show 11 replies
mperhamyesterday at 6:30 PM

Did they ever explain why netlink was involved? Or was that a red herring?

show 3 replies
yalokyesterday at 8:02 PM

Classic problem of non-atomic stack pointer modification.

Used to have a lot of fun with those 3 decades ago.

pfdietzyesterday at 10:45 PM

I see something like this and I wonder "what testing methodology would have found this?" It has to be general, not something that would involve knowing what the bug was ahead of time.

show 1 reply
gokyesterday at 4:46 PM

The real lesson here should be that doing crazy shit like swizzling the program counter in a signal handler and writing your own assembler is not a good idea.

show 5 replies
neuroelectrontoday at 12:33 AM

I've seen only one race condition in my career and it always surprises me how it is even found.

berz01yesterday at 6:28 PM

[flagged]