logoalt Hacker News

ack_complete01/18/20251 replyview on HN

CPUs are surprisingly good at dealing with this in their store queues. I see this write-all-and-increment-some technique used a lot in optimized code, like branchless left-pack routines or overcopying in the copy handler of an LZ/Deflate decompressor.


Replies

atq211901/18/2025

Yep, same with overlapping unaligned loads. It's just fairly cheap to make that stuff pipelined and run fast. It's only when you mix loads and stores in the same memory region that there are conflicts that can slow you down (and then quite horribly actually, depending on the exact processor).

show 1 reply