logoalt Hacker News

Tuna-Fishtoday at 4:43 PM1 replyview on HN

> Things like 64 bit immediates are almost certainly a bad idea (as opposed to just having a load from memory)

Strongly disagree. Throughput is cheap, latency is expensive. Any time you can fit a constant in the instruction fetch stream is a win. This is especially true for jump targets, because getting them resolved faster both saves power and improves performance.

> Most 64 bit constants in use can be sign extended from much smaller values

You should obviously also have smaller load instructions.

> will necessarily bloat instruction cache, stall your instruction decoder (or limit parallelism)

No, just have more fetch throughput.

> and will only be 2 cycles faster than a L1 cache load

Only on tiny machines will L1 cache load be 2 cycles. On a reasonable high-end machine it will be 4-5 cycles, and more critically (because the latency would usually be masked well by OoO), the energy required to engage the load path is orders of magnitude more than just getting it from the fetch.

And that's when it's not a jump target, when it's a jump target suddenly loading it using a load instruction adds 12+ cycles of latency.

> TLDR is that RISC-V makes reasonable choices for a fairly "boring" ISA.

No. Not even talking about constants, RISC-V makes insane choices for essentially religious reasons. Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate?


Replies

adgjlsfhk1today at 6:20 PM

> No, just have more fetch throughput.

Fetch throughput isn't unlimited. Modern x86 CPUs only have ~16-32B/cycle (from L2 once you're out of the uop cache). If you decode a single 10 byte instruction you're already using up a huge amount of the available decode bandwidth.

There absolutely are cases where a 64 bit load instruction would be an advantage, but ISA design is always a case of tradeoffs. Allowing 10 byte instructions has real cost in decode complexity, instruction bandwidth requirements, ensuring cacheline/pageline alignment etc. You have to weigh against that how frequent the instruction would be as well as what your alternative options are. Most imediates are small, and many others can be efficiently synthesized via 2 other instructions (e.g. shifts/xors/nots) and any synthesis that is 2 instructions or fewer will be cheaper than doing a load anyway. As a result you would end up massively complicating your architecture/decoders to benefit a fairly rare instruction which probably isn't worthwhile. It's notable that aarch64 makes the same tradeoff here and Apple's M series processors have an IPC advantage over the best x86.

> Can you explain to me why, exactly, would you ever make jal take a register operand, instead of using a fixed link register and putting the spare bits into the address immediate?

This mostly seems like a mistake to me. The rational probably is that you need the other instructions anyway (not all jumps are returns), so adding a jal that doesn't take a register would take a decent percentage of the opspace, but the extra 5 bits would be very nice.