How much VRAM would this require, if I would want to run this locally? I bought a 12GB Nvidia card...

maartenh • yesterday at 11:09 PM • 6 replies • view on HN

How much VRAM would this require, if I would want to run this locally?

I bought a 12GB Nvidia card a year ago. In general I'm having a hard time to find the actual required hardware specs for any self hosted AI model. Any tips/suggestions/recommended resources for that?

Replies

nsingh2 • yesterday at 11:13 PM

One quick way to estimate a lower bound is to take the number of parameters and multiply it with the bits per parameter. So a model with 7 billion parameters running with float8 types would be ~7 GB to load at a minimum. The attention mechanism would require more on top of that, and depends on the size of the context window.

You'll also need to load inputs (images in this case) onto the GPU memory, and that depends on the image resolution and batch size.

rahimnathwani • today at 3:33 PM

The model is 17GB, so you'd need 24GB VRAM:

https://huggingface.co/microsoft/Fara-7B/tree/main

If you want to find models which fit on your GPU, the easiest way is probably going to ollama.com/library

For a general purpose model, try this one, which should fit on your card:

https://ollama.com/library/gemma3:12b

If that doesn't work, the 4b version will definitely work.

selcuka • yesterday at 11:54 PM

I use LMStudio for running models locally (macOS) and it tries to estimate whether the model would fit in my GPU memory (which is the same thing as main memory for Macs).

The Q4_K_S quantized version of Microsoft Fara 7B is a 5.8GB download. I'm pretty sure it would work on a 12GB Nvidia card. Even the Q8 one (9.5GB) could work.

➕ show 1 reply

daemonologist • yesterday at 11:55 PM

12GB will be sufficient to run a quantized version, provided you're not running anything else memory-hungry on the GPU.

You're not finding hardware specs because there are a lot of variables at play - the degree to which the weights are quantized, how much space you want to set aside for the KV cache, extra memory needed for multimodal features, etc.

My rule of thumb is 1 byte per parameter to be comfortable (running a quantization with somewhere between 4.5 and 6 bits per parameter and leaving some room for the cache and extras), so 7 GB for 7 billion parameters. If you need a really large context you'll need more; if you want to push it you can get away with a little less.

jillesvangurp • today at 9:43 AM

It's a good reason to use macs as they have unified ram. I have a 48GB mac book pro. Plenty of memory to run these models. And the M4 Max should be plenty fast. You kind of want to have enough ram that you have plenty left to run your normal stuff after the model has loaded.

I wish I had more time to play with this stuff. It's so hard to keep up with all this.

baq • today at 6:24 AM

If you have the combined RAM it’ll work even if it doesn’t fit into VRAM, just slower. A 7B model like this one might actually be fast enough.

alt Hacker News

Replies