I have my own agent harness, and the inference backend is vLLM.

embedding-shape • today at 8:36 AM • 1 reply • view on HN

Replies

Curious how you handle sharding and KV cache pressure for a 120b model. I guess you are doing tensor parallelism across consumer cards, or is it a unified memory setup?

➕ show 1 reply

alt Hacker News

Replies