Correct me if I am wrong, but these implementations are all CPU bound?, i.e. if I have a good GPU, I should look for alternatives.
It's all implemented on the CPU, yes, there's no GPU acceleration whatsoever (at the moment at least).
> if I have a good GPU, I should look for alternatives.
If you actually want to run it, even just on the CPU, you should look for an alternative (and the alternative is called llama.cpp) this is more of an educational resource about how things work when you remove all the layers of complexity in the ecosystem.
LLM are somewhat magic in how effective they can be, but in terms of code it's really simple.
Yes. Depending on gpu 10-20x difference.
For rust you have the llama.cpp wrappers like llm_client (mine), and the candle based projects mistral.rs, and Kalosm.
Although, my project does try and provide an implementation of mistral.rs, I haven’t fully migrated from llama.cpp. A full rust implementation would be nice for quick install times (among other reasons). Right now my crate has to clone and build. It’s automated for mac, pc, and Linux but it adds about a minute of build time.
CPU, yes, but more importantly memory bandwidth.
An RTX 3090 (as one example) has nearly 1TB/s of memory bandwidth. You'd need at least 12 channels of the fastest proof-of-concept DDR5 on the planet to equal that.
If you have a discrete GPU, use an implementation that utilizes it because it's a completely different story.
Apple Silicon boasts impressive numbers on LLM inference because it has a unified CPU-GPU high-bandwidth (400GB/s IIRC) memory architecture.
Depends. Good models are big, and require a lot of memory. Even the 4090 doesn't have that much memory in an LLM context. So your GPU will be faster, but likely can't fit the big models.
You are correct. This project is "on the CPU", so it will not utilize your GPU for computation. If you would like to try out a Rust framework that does support GPUs, Candle https://github.com/huggingface/candle/tree/main may be worth exploring