i also had a similar 'hello world' experience some time ago with [0] :). i manually used some SIMD instructions, and it seems the performance could align with llama.cpp. it appears that the key to performance is:
1. using SIMD on quantized matrix multiplication 2. using a busy loop instead of condition variables when splitting work among threads.
(however, i haven't had more free time to continue working on inferencing quantized models on GPU (with Vulkan), and it hasn't been updated for a long time since then.)