Could you elaborate on what you did to get it working? I built it from source, but couldn't get it (the 4B model) to produce coherent English.
Sample output below (the model's response to "hi" in the forked llama-cli):
X ( Altern as the from (.. Each. ( the or,./, and, can the Altern for few the as ( (. . ( the You theb,’s, Switch, You entire as other, You can the similar is the, can the You other on, and. Altern. . That, on, and similar, and, similar,, and, or in
I did this: https://image.non.io/2093de83-97f6-43e1-a95e-3667b6d89b3f.we...
Literally just downloaded the model into a folder, opened cursor in that folder, and told it to get it running.
Prompt: The gguf for bonsai 8b are in this local project. Get it up and running so I can chat with it. I don't care through what interface. Just get things going quickly. Run it locally - I have plenty of vram. https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main
I had to ask it to increase the context window size to 64k, but other than that it got it running just fine. After that I just told ngrok the port I was serving it on and voila.
I have older M1 air with 8GB, but still getting ober 23 t/s on 4B model.. and the quality of outputs is on par with top models of similar size.
1. Clone their forked repo: `git clone https://github.com/PrismML-Eng/llama.cpp.git`
2. Then (assuming you already have xcode build tools installed):
3. Finally, run it with (you can adjust arguments): Model was first downloaded from: https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main