This is beautifully written, thanks for sharing.
I could see myself using some of the source code in the classroom to explain how transformers "really" work; code is more concrete/detailed than all those pictures of attention heads etc.
Two points of minor criticism/suggestions for improvement:
- libraries should not print to stdout, as that output may detroy application output (imagine I want to use the library in a text editor to offer style checking). So best to write to a string buffer owned by a logging class instance associated with a lm.rs object.
- Is it possible to do all this without "unsafe" without twisting one's arm? I see there are uses of "unsafe" e.g. to force data alignment in the model reader.
Again, thanks and very impressive!
Neat.
FYI I have a whole bunch of rust tools[0] for loading models and other LLM tasks. For example auto selecting the largest quant based on memory available, extracting a tokenizer from a gguf, prompting, etc. You could use this to remove some of the python dependencies you have.
Currently to support llama.cpp, but this is pretty neat too. Any plans to support grammars?
The title is less clear than it could be IMO.
When I saw "no dependency" I thought maybe it could be no_std (llama.c is relatively lightweight in this regard). But it's definitely not `no_std` and in fact seems like it has several dependencies. Perhaps all of them are rust dependencies?
Great! Did something similar some time ago [0] but the performance was underwhelming compared to C/C++ code running on CPU (which points to my lack of understanding of how to make Rust fast). Would be nice to have some benchmarks of the different Rust implementations.
Implementing LLM inference should/could really become the new "hello world!" for serious programmers out there :)
This is really cool.
It's already using Dioxus (neat). I wonder if WASM could be put on the roadmap.
If this could run a lightweight LLM like RWKV in the browser, then the browser unlocks a whole class of new capabilities without calling any SaaS APIs.
Correct me if I am wrong, but these implementations are all CPU bound?, i.e. if I have a good GPU, I should look for alternatives.
This is cool (and congrats on writing your first Rust lib!), but Metal/Cuda support is a must for serious local usage.
Interesting, I appreciate the rust community‘s enthu to rewrite most the stuff.
Nice work, it would be great to see some benchmarks comparing it to llm.c.
how does this compare to https://github.com/EricLBuehler/mistral.rs ?
Such a talented guy!
Another llama.cpp and mistral.rs? If it support vision models then fine, I will try it.
EDIT: Looks like no L3.2 11B yet.
This is impressive. I just ran the 1.2G llama3.2-1b-it-q80.lmrs on a M2 64GB MacBook and it felt speedy and used 1000% of CPU across 13 threads (according to Activity Monitor).