8GB is not enough to do complex reasoning, but you could do very small simple things. Models like Wh...

0xbadcafebee • yesterday at 5:42 PM • 0 replies • view on HN

8GB is not enough to do complex reasoning, but you could do very small simple things. Models like Whisper, SmolVLM, Quen2.5-0.5B, Phi-3-mini, Granite-4.0-micro, Mistral-7B, Gemma3, Llama-3.2 all work on very little memory. Tiny models can do a lot if you tune/train them. They also need to be used differently: system prompt preloaded with information, few-shot examples, reasoning guidance, single-task purpose, strict output guidelines. See https://github.com/acon96/home-llm for an example. For each small model, check if Unsloth has a tuned version of it; it reduces your memory footprint and makes inference faster.

For your Mac, you can use Ollama, or MLX (Mac ARM specific, requires different engine and different model disk format, but is faster). Ramalama may help fix bugs or ease the process w/MLX. Use either Docker Desktop or Colima for the VM + Docker.

For today's coding & reasoning models, you need a minimum of 32GB VRAM combined (graphics + system), the more in GPU the better. Copying memory between CPU and GPU is too slow so the model needs to "live" in GPU space. If it can't fit all in GPU space, your CPU has to work hard, and you get a space heater. That Mac M1 will do 5-10 tokens/s with 8GB (and CPU on full blast), or 50 token/s with 32GB RAM (CPU idling). And now you know why there's a RAM shortage.

alt Hacker News