logoalt Hacker News

embedding-shapetoday at 8:36 AM1 replyview on HN

I have my own agent harness, and the inference backend is vLLM.


Replies

storystarlingtoday at 10:32 AM

Curious how you handle sharding and KV cache pressure for a 120b model. I guess you are doing tensor parallelism across consumer cards, or is it a unified memory setup?

show 1 reply