Given the fact that this at the core relies on the `rayon` and `wide` libraries, which are decently baseline optimized but quite a bit away from what llama.cpp can do when being specialized on such a specific use-case, I think the speed is about what I would expect.
So yeah, I think there is a lot of room for optimization, and the only reason one would use this today is if they want to have a "simple" implementation that doesn't have any C/C++ dependencies for build tooling reasons.
Your point is valid when it comes to rayon (I don't know much about wide) being inherently slower than custom optimization, but from what I've seen I suspect rayon isn't even the bottleneck in terms of performance, there's some decent margin of improvement (I'd expect at least double the throughput) without even doing arcane stuff.