logoalt Hacker News

How much do amd64 microarchitecture levels help in Go?

66 pointsby zdwlast Sunday at 9:10 PM41 commentsview on HN

Comments

seddonm1today at 4:06 AM

I would be interested to know if there is a method similar to this one in Rust [0] that allows a single binary to support multiple optimization levels depending on the executing CPU? It feels wasteful to not enable these optimizations but I don't really want to force a user to choose between a complex feature matrix.

[0] https://github.com/ronnychevalier/cargo-multivers

show 3 replies
cpercivatoday at 6:11 AM

arguably the whole scheme should be replaced by finer-grained feature detection.

This seems like a strange thing to say. Fine grained feature detection was around long before "microarchitecture levels" and never went away. The microarchitecture levels were introduced because they were easier to use.

vintagedavetoday at 8:23 AM

This measurement is very focused on bit-related instructions. A few months ago we did some work on our (RemObjects) toolchain, and it showed very similar results, where our published benchmarks were for floating point.[0] I did some rough internal measurements showing the same for integer-related instructions too.

The same conclusion: v2 as baseline, v3 where possible.

I'm really surprised it's not standard in every toolchain to support arch levels like this today.

Some compilers like Clang allow multiple arch versions in one binary, runtime dispatched. I would love to implement this in our toolchain too.

[0] Please forgive the SEO-style title, it's, well, to get search engines to recognise what's in the article: https://blogs.remobjects.com/2026/01/26/fast-math-in-six-lan...

deathanatostoday at 6:39 AM

> That is a 43% reduction, and it is free: no source change, just a compiler flag.

It's not entirely free; the cost is that the resulting binary will no longer run on processors that lack the instruction. Which, admittedly, is ≈2007 or older. But still! I have a 2012 CPU still in service, and as much as I'd love to obsolete it, gestures at the price tag of RAM these days.

… a 2012 CPU is surprisingly competitive relative to today's tech, too, I'd add. The gap between 2012 and 2026 is nothing compared to the equivalent gap between 1998 and 2012: 1998 is like 500MHz single-core, 32-bit. 2012 is 4 core, 8 hyper threads, 64-bit, 3.5 GHz. (… perhaps more remarkably, my next-oldest machine, a 2017 laptop, is only 2.8 GHz, with the same 4(/8) cores. It also uses like half the power, too. That's mostly the "laptop" bit, though.)

(That same CPU is also incapable of "v3".)

show 1 reply
GianFabientoday at 6:43 AM

I think the more critical question is how well compiler writers can update the heuristics which identify the instruction sequences that benefit from the architectural features. Last I looked, Intel has several thousand intrinsics which must be explicitly invoked to make use of specific features.

I suspect that heavily optimised code either uses intrinsics or carefully written assembler code.

show 2 replies
nevi-metoday at 6:40 AM

Does Docker have uarch level support? I think similar to arch level, it could be beneficial being able to pull a v4 image.

Ubuntu started allowing defaulting to v3 packages, and I opted in. I already use the -C native to enable AVX512 when compiling binaries for local use. This matters a lot for compute/analytics workloads in my experience.

show 1 reply
kristianptoday at 6:03 AM

I'm surprised that Go doesn't default to AVX2 support by now, considering that Haswell started shipping in mid 2013.

Speaking of Dr Lemire's suggestion of a V5 architecture level, would that make any sense given the fragmentation of AVX512? None on Intel consumer devices, but it is on the last few generations of AMD.

show 2 replies
jeffrallentoday at 5:25 AM

This is one of the clearest example of diminishing returns I've ever seen. It comes up everywhere.

I wonder if this is a natural law, or emergent behavior of complex systems?

pixelpoettoday at 10:36 AM

These slop images he uses for his articles are so bad; here it says "accelation"...

show 1 reply
andrewstuarttoday at 5:39 AM

I would have thought you’d need to explicitly code to match the cpu capabilities to your application, for maximum benefit.

haeseongtoday at 5:11 AM

[dead]

stefantalpalarutoday at 8:53 AM

[dead]

pjmlptoday at 5:41 AM

Nothing, because this is a compiler question, not a language one.