Lm.rs: Minimal CPU LLM inference in Rust with no dependency

310 points • by littlestymaar • 10/11/2024 • 76 comments • view on HN

Comments

simonw • 10/11/2024

This is impressive. I just ran the 1.2G llama3.2-1b-it-q80.lmrs on a M2 64GB MacBook and it felt speedy and used 1000% of CPU across 13 threads (according to Activity Monitor).

    cd /tmp
    git clone https://github.com/samuel-vitorino/lm.rs
    cd lm.rs
    RUSTFLAGS="-C target-cpu=native" cargo build --release --bin chat
    curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/tokenizer.bin?download=true'
    curl -LO 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_0-LMRS/resolve/main/llama3.2-1b-it-q80.lmrs?download=true'
    ./target/release/chat --model llama3.2-1b-it-q80.lmrs

➕ show 3 replies

jll29 • 10/11/2024

This is beautifully written, thanks for sharing.

I could see myself using some of the source code in the classroom to explain how transformers "really" work; code is more concrete/detailed than all those pictures of attention heads etc.

Two points of minor criticism/suggestions for improvement:

- libraries should not print to stdout, as that output may detroy application output (imagine I want to use the library in a text editor to offer style checking). So best to write to a string buffer owned by a logging class instance associated with a lm.rs object.

- Is it possible to do all this without "unsafe" without twisting one's arm? I see there are uses of "unsafe" e.g. to force data alignment in the model reader.

Again, thanks and very impressive!

➕ show 1 reply

J_Shelby_J • 10/11/2024

Neat.

FYI I have a whole bunch of rust tools[0] for loading models and other LLM tasks. For example auto selecting the largest quant based on memory available, extracting a tokenizer from a gguf, prompting, etc. You could use this to remove some of the python dependencies you have.

Currently to support llama.cpp, but this is pretty neat too. Any plans to support grammars?

[0] https://github.com/ShelbyJenkins/llm_client

wyldfire • 10/11/2024

The title is less clear than it could be IMO.

When I saw "no dependency" I thought maybe it could be no_std (llama.c is relatively lightweight in this regard). But it's definitely not `no_std` and in fact seems like it has several dependencies. Perhaps all of them are rust dependencies?

➕ show 4 replies

gip • 10/11/2024

Great! Did something similar some time ago [0] but the performance was underwhelming compared to C/C++ code running on CPU (which points to my lack of understanding of how to make Rust fast). Would be nice to have some benchmarks of the different Rust implementations.

Implementing LLM inference should/could really become the new "hello world!" for serious programmers out there :)

[0] https://github.com/gip/yllama.rs

➕ show 1 reply

echelon • 10/11/2024

This is really cool.

It's already using Dioxus (neat). I wonder if WASM could be put on the roadmap.

If this could run a lightweight LLM like RWKV in the browser, then the browser unlocks a whole class of new capabilities without calling any SaaS APIs.

➕ show 2 replies

lucgagan • 10/11/2024

Correct me if I am wrong, but these implementations are all CPU bound?, i.e. if I have a good GPU, I should look for alternatives.

➕ show 6 replies

dcreater • 10/12/2024

What's the value of this compared to llama.cpp?

➕ show 2 replies

dvt • 10/12/2024

This is cool (and congrats on writing your first Rust lib!), but Metal/Cuda support is a must for serious local usage.

➕ show 1 reply

aravindputrevu • 10/12/2024

Interesting, I appreciate the rust community‘s enthu to rewrite most the stuff.

fuddle • 10/11/2024

Nice work, it would be great to see some benchmarks comparing it to llm.c.

➕ show 1 reply

nikolayasdf123 • 10/12/2024

how does this compare to https://github.com/EricLBuehler/mistral.rs ?

➕ show 1 reply

kvakkefly • 10/12/2024

Would love to see a wasm version of this!

➕ show 1 reply

marques576 • 10/11/2024

Such a talented guy!

vietvu • 10/12/2024

Another llama.cpp and mistral.rs? If it support vision models then fine, I will try it.

EDIT: Looks like no L3.2 11B yet.

➕ show 1 reply

alt Hacker News

Lm.rs: Minimal CPU LLM inference in Rust with no dependency

Comments