I'll try to use that, but llama-server has mmap on by default and the model still takes up the ...

abhikul0 • yesterday at 3:09 PM • 1 reply • view on HN

I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on.

zozbot234 • yesterday at 3:14 PM

Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap.

alt Hacker News