logoalt Hacker News

jmward01today at 1:32 AM1 replyview on HN

The point is not how fast it is now. The point is that this opens new possibilities that can be built on. Potentially models that are trained with slightly different architectures to optimize to this use case. Possibly others come to improve this path. Possibly HW manufacturers make a few small adjustments that remove bottlenecks. Who knows, the next person may combine CPU compute with this mem sharing to get another token a second. Then the next person does predictive loading into memory to keep that bandwith 100% maxed and usable. Then the next does and the next does. Before you know it there is a real thing there that never existed.

This is a great project. I love the possibilities it hints at. Thanks for building it!


Replies

smallnamespacetoday at 2:43 AM

It’s architecturally not a good approach. System RAM is much slower so you should put data that doesn’t need to be used often on it. That knowledge is at the application layer. Adding a CUDA shim makes system RAM appear like VRAM, which gets things to run, but it will never run very well.

The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.

show 4 replies