LDIR sounds great on paper but is implemented terribly making it slower than manual unrolled loop
https://retrocomputing.stackexchange.com/questions/4744/how-...
Repeat is done by decrementing PC by 2 and re-loading whole instruction in a loop. 21 cycles per byte copied :o
To be fair Intel did same fail implementation of REP MOVSB/MOVSW in 8088/8086 reloading whole instruction per iteration, REP MOVSW is ~14 cycles/byte 8088 (9+27/rep) and ~9 cycles/byte 8086 (9+17/rep), ~same cost as non REP versions (28 and 18). NEC V20/V30 improved by almost 2x to 8 cycles/byte V20 or unaligned V30 (11+16/rep) and 4 cycles/byte on fully aligned access V30 (11+8/rep) with non REP cost being 19 and 11 respectively. V30 pretty much matched Intel 80186 4 cycles/byte (8+8/rep, 9 non rep). 286 was another jump to 2 cycles/byte (5+4/rep). 386 same speed, 486 much slower for small rep counts, under a cycle for big rep movsd. Pentium up to 0.31 cycles per byte, MMX 0.27 cycle/byte (http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DU...), then 2009 AVX doing block moves at full L2 cache speed and so on.
In 6502 corner there was nothing until 1986 WDC W65C816 Move Memory Negative (MVN), Move Memory Positive (MVP) 7 cycles/byte. Slower than unrolled code, 2x slower than unrolled code using 0 page. Similar bad implementation (no loop buffer) re-fetching whole instruction every iteration.
1987 NEC TurboGrafx-16/PC Engine 6502 clone by HudsonSoft HuC6280 Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII) theoretical 6 cycles/byte (17+6rep). I saw one post long time ago claiming block transfer throughput of ~160KB/s on a 7.16 MHz NEC manufactured TurboGrafx-16 (hilarious 43 cycles/byte) so dont know what to think of it considering NEC V20 inside OG 4.77MHz IBM XT does >300KB/s.
CPU / Instruction Cycles per Byte
Z80 LDIR 8-bit 21
8088 MOVSW 8bit ~14
6502 LDA/STA 8bit ~14
8086 MOVSW ~9
NEC V20 MOVBKW 8bit ~8
W65C816 MVN/MVP 8bit ~7 block move
HuC6280 T[DIAX]/TIN 8bit ~6 block transfer instructions
80186 MOVSW 16bit ~4
NEC V30 MOVSW ~4
80286 MOVSW ~2
486 MOVSD <1
Pentium MOVSD ~0.31
Pentium MMX MOVSD ~0.27 http://www.pennelynn.com/Documents/CUJ/HTML/14.12/DURHAM1/DURT1.HTM