How is the time sharing handled? I assume if I submit a unit of work it will load to VRAM and then run (sharing time? how many work units can run in parallel?)
How large is a full context window in MiB and how long does it take to load the buffer? I.e. how many seconds should I expect my worst case wait time to take until I get my first token?
> how many work units can run in parallel
not original author but batching is one very important trick to make inference efficient, you can reasonably do tens to low hundreds in parallel (depending on model size and gpu size) with very little performance overhead
vLLM handles GPU scheduling, not sllm. The model weights stay resident in VRAM permanently so there's no loading/unloading per request. vLLM uses continuous batching, so incoming requests are dynamically added to the running batch every decode step and the GPU is always working on multiple requests simultaneously. There is no "load to VRAM and run" per request; it's more like joining an already-running batch.
TTFT is under 2 seconds average. Worst case is 10-30s.