I'm guessing this is also calculating based on the full context size that the model supports bu...

tommy_axle • today at 6:36 PM • 0 replies • view on HN

I'm guessing this is also calculating based on the full context size that the model supports but depending on your use case it will be misleading. Even on a small consumer card with Qwen 3 30B-A3B you probably don't need 128K context depending on what you're doing so a smaller context and some tensor overrides will help. llama.cpp's llama-fit-params is helpful in those cases.

alt Hacker News