I’m recreating a tiny version of vLLM in C++ and CUDA from scratch (high throughput LLM inference se...

yu3zhou4 • yesterday at 5:35 AM • 0 replies • view on HN

I’m recreating a tiny version of vLLM in C++ and CUDA from scratch (high throughput LLM inference server)

alt Hacker News