I'm using an M2 64GB MacBook Pro. For the Llama 8B one I would expect 16GB to be enough.
I don't have any experience running models on Windows or Linux, where your GPU VRAM becomes the most important factor.
On Windows or Linux you can run from RAM or split layers between RAM and VRAM; running fully on GPU is faster than either of those, but the limit on what you can run at all isn’t VRAM.
Why isn't GPU VRAM a factor on a Silicon mac?
On Windows or Linux you can run from RAM or split layers between RAM and VRAM; running fully on GPU is faster than either of those, but the limit on what you can run at all isn’t VRAM.