How many branches can your CPU predict?

85 points • by chmaynard • last Wednesday at 11:39 PM • 49 comments • view on HN

Comments

Hmm, that's interesting. The code as written only has one branch, the if statement (well, two, the while loop exit clause as well). My mental model of the branch predictor was that for each branch, the CPU maintained some internal state like "probably taken/not taken" or "indeterminate", and it "learned" by executing the branch many times.

But that's clearly not right, because apparently the specific data it's branching off matters too? Like, "test memory location X, and branch at location Y", and it remembers both the specific memory location and which specific branch branches off of it? That's really impressive, I didn't think branch predictors worked like that.

Or does it learn the exact pattern? "After the pattern ...0101101011000 (each 0/1 representing the branch not taken/taken), it's probably 1 next time"?

➕ show 5 replies

intrasight • today at 3:11 AM

I was self-taught in high school on computer architecture by reading book. I didn't own a computer, understand, but these book served the same purpose in terms of learning CPU architectures and machine language programming. The 6502 was the CPU I studied.

In 1985 as an EE student, I took a course in modern CPU architectures. I still recall having my mind blown when learning about branch prediction and speculative execution. It was a humbling moment - as was pretty much all of my studies as CMU.

stephencanon • yesterday at 2:10 PM

Enlarging a branch predictor requires area and timing tradeoffs. CPU designers have to balance branch predictor improvements against other improvements they could make with the same area and timing resources. What this tells you is that either Intel is more constrained for one reason or another, or Intel's designers think that they net larger wins by deploying those resources elsewhere in the CPU (which might be because they have identified larger opportunities for improvement, or because they are basing their decision making on a different sample of software, or both).

➕ show 1 reply

bee_rider • yesterday at 2:17 PM

I guess the generate_random_value function uses the same seed every time, so the expectation is that the branch predictor should be able to memorize it with perfect accuracy.

But the memorization capacity of the branch predictor must be a trade-off, right? I guess this generate_random_value function is impossible to predict using heuristics, so I guess the question is how often we encounter 30k long branch patterns like that.

Which isn’t to say I have evidence to the contrary. I just have no idea how useful this capacity actually is, haha.

➕ show 2 replies

stevefan1999 • today at 5:28 AM

I still remember learning about TAGE and preceptron predictors, and how machine learning and neural networks has long been, in some form, been used in CPU architecture design.

The simplest binary saturating counter, ala bimodal predictor, already achieved more than 90% success rate. What comes next is just extension around that, but the core idea that treating the branch prediction using a Bayesian approach, never fades.

It is a combined effort between hardware design and software compiler, though.

Night_Thastus • yesterday at 3:19 PM

AMD CPUs have been killing it lately, but this benchmark feels quite artificial.

It's a tiny, trivial example with 1 branch that behaves in a pseudo-random way (random, but fixed seed). I'm not sure that's a really good example of real world branching.

How would the various branch predictors perform when the branch taken varies from 0% likely to 100% likely, in say, 5% increments?

How would they perform when the contents of both paths are very heavy, which involves a lot of pipeline/SE flushing?

How would they perform when many different branches all occur in sequence?

How costly are their branch mispredictions, relative to one another?

Without info like that, this feels a little pointless.

➕ show 2 replies

Paul_Clayton • today at 1:18 AM

By only testing one static branch, it is possible that the performance of the Intel Emerald Rapids predictor is not representative of a more realistic workload. If path information is used to index the predictor in addition to global (taken/not taken) branch history without xoring with the global history (or fulling mingling these different data) or if the branch address is similarly not fully scrambled with the global history, using only one branch might result in predictor storage being unused (never indexed). Either mechanism might be useful for reducing tag overhead while maintaining fewer aliases. Another possibility is that the associativity of the tables does not allow tags for the same static branch to differ.

(Tags could be made to differ by, e.g., XORing a limited amount of global history with the hash of the address.)

It is also possible that the AMD Zen 5 and Apple M4 have similar unused predictor capacity and simply have much larger predictors.

I did not think even TAGE predictors used 5k branch history, so there may be some compression of the data (which is only pseudorandom).

It might be interesting to unroll the loop (with sufficient spacing between branches to ensure different indexing) to see if such measurably effected the results.

Of course, since "write to buffer" is just a store and increment and the compiler should be able to guarantee no buffer overflow (buffer size allocated for worst case) and that the memory store has no side effects, the branch could be predicated by selecting either new value to be stored or the old value and always storing. This would be a little extra work and might have store queue issues (if not all store queue entries can have the same address but different version numbers), so it might not be a safe optimization.

rsmtjohn • today at 7:33 AM

The Rust borrow checker has indirectly made me more aware of branch patterns -- it sometimes forces code restructuring that changes what the predictor actually sees.

The clearest wins I've found: replacing conditional returns in hot loops with branchless arithmetic. The predictor loves it when you stop giving it choices. Lookup tables for small bounded ranges are another one that consistently surprises me with how much headroom there still is.

barbegal • today at 1:59 AM

This is good work. I wish branch predictor were better reverse engineered so CPU simulation could be improved. It would be much better to be able to accurately predict how software will work on other processors in software simulation rather than having to go out and buy hardware to test on (which is the way we still have to do things in 2026)

infinitewars • today at 3:48 AM

By the no-free-lunch theorem, and the fact this 30k random branch pattern is so atypical in the real world, it would imply the loser here (Intel) is more likely to be the best branch predictor in actual benchmarks.

At least that's my prediction.

➕ show 1 reply

withinboredom • yesterday at 1:59 PM

Before switching to a hot and branchless code path, I was seeing strangely lower performance on Intel vs. AMD under load. Realizing the branch predictor was the most likely cause was a little surprising.

ww520 • today at 4:51 AM

Branch prediction works really well on loops. The looping condition is mostly true except for the very last time. The loop body is always predicted to run. If you structure the loop body to have no data dependence between iterations, multiple iterations of the loop can run in parallel. Greatly improve the performance.

atq2119 • today at 5:24 AM

I find it interesting that the S-curve is much steeper for AMD than it is for the others. AMD maintains perfect prediction for much larger sizes than the others, but it also reaches essentially random behaviour earlier.

Are they really keeping a branch history that's 30k deep? Or is there some kind of hashing going on, and AMD's hash just happens to be more attuned to the PRNG used here?

Would be interesting to see how robust these results are against the choice of PRNG and seed.

➕ show 1 reply

piinbinary • yesterday at 7:51 PM

How does the benchmark tell how many branches were mispredicted? Is that something the processor exposes?

➕ show 1 reply

user070223 • yesterday at 4:00 PM

Does any JIT/AOT/hot code optimization/techniques/compilers/runtime takes into account whether the branch prediction is saturated and try to recompile to go branchless

➕ show 1 reply

rayiner • yesterday at 2:20 PM

Using random values defeats the purpose of the branch predictor. The best branch predictor for this test would be one that always predicts the branch taken or not taken.

➕ show 2 replies

tonetegeatinst • today at 3:53 AM

Intel is currently looking into replacing their branch prediction with a system based on astrology, tarot cards and crystal balls.

➕ show 1 reply

themafia • today at 2:56 AM

The testing function seems a little simple since branch density is a factor. I still love this reference:

https://blog.cloudflare.com/branch-predictor/

Should be titled: How I Learned to Stop Worrying and Love the Branch Predictor

➕ show 1 reply

openclaw01 • today at 5:45 AM

[dead]

unit149 • today at 12:41 AM

[dead]

alt Hacker News

How many branches can your CPU predict?

Comments