It is no so much the parts of the code that run infrequently that contribute to poor performance, but the very tiny <1% of all code that does run frequently, and should be running completely in cache. So code size doesn't have an enormous impact on speed of execution.
The overhead of threading seems pretty obvious: call and return instructions are expensive compared to the cost of the one equivalent instruction that would have been executed in a compiled implementation. And placing arguments on a stack means that all operands have to to be read from and written to memory, incurring additional ferocious overhead, whereas a compiler would enregister values, particularly in performance-critical code. Not unreasonable to expect that Forth code is going to run at least an order of magnitude slower than compiled code.
Some of those could be partially fixed in hardware. Examples:
- returns can run in as good as zero time
- when calling a word, the CPU could prefetch the cache line containing the next word to be called, on the assumption that it would be called soon (an assumption that would be correct for many “almost leaf” calls)
- the top of the stack could be kept in register-speed memory.
For an example, see https://users.ece.cmu.edu/~koopman/stack_computers/sec4_4.ht...:
“The internal structure of the NC4016 is designed for single clock cycle instruction execution. All primitive operations except memory fetch, memory store, and long literal fetch execute in a single clock cycle. This requires many more on-chip interconnection paths than are present on the Canonical Stack Machine, but provides much better performance.
[…]
The NC4016 subroutine return bit allows combining a subroutine return with other instructions in a similar manner. This results in most subroutine exit instructions executing "for free" in combination with other instructions. An optimization that is performed by NC4016 compilers is tail-end recursion elimination. Tail-end recursion elimination involves replacing a subroutine call/subroutine exit instruction pair by an unconditional branch to the subroutine that would have been called.”
(¿Almost?) all modern hardware is designed for other language paradigms, though, so these won’t help with hardware you can buy.