logoalt Hacker News

jdw64today at 2:40 PM2 repliesview on HN

I understood up to the comparison operations in this article, but I'm having trouble understanding branch prediction even after reading it. It seems like they used branch prediction optimization... Sometimes when I read articles like this, I start to question whether I can really call myself a programmer


Replies

MobiusHorizonstoday at 4:38 PM

In short modern CPUs can’t actually execute control flow very fast, so they usually cheat and memorize where the program is likely to go next. This works great with things like the loop branch, since most loops are pretty long the cpu can start on the next iteration of the loop while it waits for the condition to finish being calculated. You incur one branch misprediction at the end, but the rest of the loop is fast. But with a bytecode vm, the action of the code is entirely dependent on the bytecode instructions, so it’s not predictable unless those instructions themselves are predictable. That’s why the switch statement is slow. Explaining the optimization from computed gotos is more complicated.

As I understand it, the branch predictor stores its predictions using the address of the branch instruction as a key. In the switch implementation the unpredictable branch is the one that jumps to the specific case. The compiler emits an indirect branch for this just like the computed goto instruction. But it only emits one indirect branch instruction, so that branch is always unpredictable. In the computed goto version, there is no loop, each case statement ends in its own indirect branch, so each bytecode implementation is followed by a branch with a unique address. In practice this makes each of them slightly more predictable, because you only have to predict what comes after an inc instruction instead of what comes next after any instruction. Basically there are more branches to hang predictions on.

adrian_btoday at 3:24 PM

In a CPU, there are 2 kinds of branch predictors, predictors for conditional branches, which must predict only whether the branch is taken or not taken, and predictors for indirect branch instructions, which must predict the address of the next instruction that will be executed after the indirect branch instruction.

The indirect branch predictor stores information about indirect branches in a structure that is similar to a cache memory.

When the CPU loads some data from the main memory, the loaded data together with the address from where it was loaded is stored in the cache memory.

When another load instruction is executed later, the load address is used to query the cache memory and if the query is successful the data value is returned by the query and it is used instead of accessing again the main memory.

Similarly, (simplifying a lot) when an indirect branch instruction is executed the address where the execution continues after the indirect branch is stored together with the address of the branch instruction in an associative memory similar to a cache memory.

When an indirect branch instruction is executed later, the address of the branch instruction is used to query the associative memory and if the query is successful it will return the address of the next instruction to be executed and execution will continue there speculatively, until it is confirmed by reading from the main memory that this is the correct jump address. If the jump address is wrong then all the work done is discarded and execution continues at the right address.

This description is simplified, because modern branch descriptors may store for a given branch instruction address not only the last jump address, but a sequence of past jump addresses, together with a pattern about how they have been used (e.g. alternating between jumping twice to the first address and jumping once to the second address). Thus successive queries for the branch instruction address will retrieve different jump addresses, based on their activation pattern.

The point of TFA is that if the dispatch loop contains a single indexed branch instruction, then the associative memory of the indirect branch predictor contains a single record, keyed by the address of that instruction. Depending on the CPU, the record will contain one or more past addresses, but it can predict correctly only a short periodic pattern of jump addresses, i.e. of opcodes used in dispatching.

This could work to predict correctly a short loop in the interpreted program, but typically most of the predictions will be wrong and each branch misprediction takes much more time than the execution of any instruction. In smaller and cheaper CPUs, which might store a single jump address per record, the correct predictions would be extremely rare, as they would happen only if the interpreted program contains repeated instructions.

In the optimized program, the indexed branch instruction is replicated into each "case" so now the associative memory will store separate records, one for each "case", containing the address or addresses of the next cases where execution has continued in the past.

Because after each interpreted instruction the probabilities of the next instructions are not equal, but some instructions are more likely to follow, that greatly increases the chances that the indirect branch predictor will make correct predictions from time to time.

show 3 replies