Great! Did something similar some time ago [0] but the performance was underwhelming compared to C/C++ code running on CPU (which points to my lack of understanding of how to make Rust fast). Would be nice to have some benchmarks of the different Rust implementations.
Implementing LLM inference should/could really become the new "hello world!" for serious programmers out there :)
i also had a similar 'hello world' experience some time ago with [0] :). i manually used some SIMD instructions, and it seems the performance could align with llama.cpp. it appears that the key to performance is:
1. using SIMD on quantized matrix multiplication 2. using a busy loop instead of condition variables when splitting work among threads.
(however, i haven't had more free time to continue working on inferencing quantized models on GPU (with Vulkan), and it hasn't been updated for a long time since then.)
[0] https://github.com/crabml/crabml