logoalt Hacker News

flaneur202010/12/20240 repliesview on HN

i also had a similar 'hello world' experience some time ago with [0] :). i manually used some SIMD instructions, and it seems the performance could align with llama.cpp. it appears that the key to performance is:

1. using SIMD on quantized matrix multiplication 2. using a busy loop instead of condition variables when splitting work among threads.

(however, i haven't had more free time to continue working on inferencing quantized models on GPU (with Vulkan), and it hasn't been updated for a long time since then.)

[0] https://github.com/crabml/crabml