In the 90s we had segmented memory programming with near and far pointers, and you had to be very ca...

amelius • 10/11/2024 • 4 replies • view on HN

In the 90s we had segmented memory programming with near and far pointers, and you had to be very careful about when you used what type of pointer and how you'd organize your memory accesses. Then we got processors like the 286 that finally relieved us from this constrained way of programming.

I can't help but feel that with CUDA we're having new constraints (32 threads in a warp, what?), which are begging to be unified at some point.

Replies

dahart • 10/11/2024

While reading I thought you were going to suggest unified memory between RAM and VRAM, since that’s somewhat analogous, though that does exist with various caveats depending on how it’s setup & used.

SIMD/SIMT probably isn’t ever going away, and vector computers have been around since before segmented memory; the 32 threads in a CUDA warp is the source of its performance superpower, and the reason we can even fit all the transistors for 20k simultaneous adds & multiplies, among other things, on the die. This is conceptually different from your analogy too, the segmented memory was a constraint designed to get around pointer size limits, but 32 threads/warp isn’t getting us around any limits, it’s just a design that provides high performance if you can organize your threads to all do the same thing at the same time.

talldayo • 10/11/2024

You can blame ARM for the popularity of CUDA. At least x86 had a few passable vector ISA ops like SSE and AVX - the ARM spec only supports the piss-slow NEON in it's stead. Since you're not going to unify vectors and mobile hardware anytime soon, the majority of people are overjoyed to pay for CUDA hardware where GPGPU compute is taken seriously.

There were also attempts like OpenCL, that the industry rejected early-on because they thought they'd never need a CUDA alternative. Nvidia's success is mostly built on the ignorance of their competition - if Nvidia was allowed to buy ARM then they could guarantee the two specs never overlap.

➕ show 3 replies

krapht • 10/11/2024

I'll believe it when autovectorization is actually useful in day to day high performance coding work.

It's just a hard problem. You can code ignorantly with high level libraries but you're leaving 2x to 10x performance on the table.

dboreham • 10/11/2024

386?

alt Hacker News

Replies