logoalt Hacker News

Paul_Claytontoday at 1:18 AM1 replyview on HN

By only testing one static branch, it is possible that the performance of the Intel Emerald Rapids predictor is not representative of a more realistic workload. If path information is used to index the predictor in addition to global (taken/not taken) branch history without xoring with the global history (or fulling mingling these different data) or if the branch address is similarly not fully scrambled with the global history, using only one branch might result in predictor storage being unused (never indexed). Either mechanism might be useful for reducing tag overhead while maintaining fewer aliases. Another possibility is that the associativity of the tables does not allow tags for the same static branch to differ.

(Tags could be made to differ by, e.g., XORing a limited amount of global history with the hash of the address.)

It is also possible that the AMD Zen 5 and Apple M4 have similar unused predictor capacity and simply have much larger predictors.

I did not think even TAGE predictors used 5k branch history, so there may be some compression of the data (which is only pseudorandom).

It might be interesting to unroll the loop (with sufficient spacing between branches to ensure different indexing) to see if such measurably effected the results.

Of course, since "write to buffer" is just a store and increment and the compiler should be able to guarantee no buffer overflow (buffer size allocated for worst case) and that the memory store has no side effects, the branch could be predicated by selecting either new value to be stored or the old value and always storing. This would be a little extra work and might have store queue issues (if not all store queue entries can have the same address but different version numbers), so it might not be a safe optimization.


Replies

ralferootoday at 9:36 AM

I use a similar conditional write paradigm on the GPU and it's usually easiest to do an unconditional write and update the address using a branchless conditional, assuming you are using a system with strict write ordering. Usually the unnecessary writes won't make it out of L1 cache.