Nano-vLLM: How a vLLM-style inference engine works

185 points • by yz-yu • today at 12:52 PM • 24 comments • view on HN

Comments

The whole thing feels AI written, generated from the codebase.*

*this is incorrect per the author’s response, my apologies.

For instance, it goes into (nano)vLLM internals and doesn’t mention PagedAttention once (one of the core ideas that vLLM is based on)[1].

Also mentions that Part 2 will cover dense vs MoE’s, which is weird because nanovllm hardcodes a dense Qwen3 into the source.

Here are better (imo) explainers about how vLLM works:

- https://hamzaelshafie.bearblog.dev/paged-attention-from-firs...

- https://www.aleksagordic.com/blog/vllm

- https://huggingface.co/blog/continuous_batching

Aleksa’s blog is a bit in the weeds for my taste but it’s really worth working through.

A lot of the magic of vLLM happens in the PagedAttention kernels, which are really succinctly implanted in nanovllm. And the codebase is great and readable by itself!

—

1. https://arxiv.org/abs/2309.06180

➕ show 2 replies

yz-yu • today at 3:43 PM

Since HN only allows one link per submission, dropping Part 2 here.

https://www.neutree.ai/blog/nano-vllm-part-2

vitaelabitur • today at 7:36 PM

Shameless plug for my structured LLM outputs handbook which is written in a similar spirit: https://nanonets.com/cookbooks/structured-llm-outputs/

OsamaJaber • today at 6:12 PM

Great job! This is the kind of project that should exist for every complex system Systems like vLLM's codebase are massive and hard to follow Would love to see the same approach for other infra (a nano-Kubernetes, nano Postgres.....

alt Hacker News

Nano-vLLM: How a vLLM-style inference engine works

Comments