> The operation GP is thinking of is a full scan, and that will always take n(n^(1/3)) lower...

Dylan16807 • yesterday at 10:15 PM • 1 reply • view on HN

> The operation GP is thinking of is a full scan, and that will always take n(n^(1/3)) lower bound time.

It doesn't. Full scans are faster than accessing each memory address in an unordered way.

Let's look at a Ryzen 2600X. You can sustain 32 bytes per second from L1, 32 bytes per second from L2, and 20 bytes per cycle from L3. That's 64KB, 512KB, and 16MB caches all having almost the same bandwidth despite very different latencies.

You can also imagine an infiniband network that fills 2 racks, and another one that fills 50000 racks. The bandwidth of a single node is the same in both situations, so even though latency gets worse as you add more nodes and hops, it's going to take exactly O(n) time for a single thread to scan the entire memory.

You can find correlations between memory size and bandwidth, but they're significantly weaker and less consistent than the correlations between memory size and latency.

Replies

hinkley • today at 12:29 AM

I don’t think I can agree to ignore the latency problem. Intermachine Latency is the source of a cube root of N term.

➕ show 1 reply

alt Hacker News

Replies