When you say tok/s here are you describing the prefill (prompt eval) token/s or the output generation tok/s?
(Btw I believe the "--jinja" flag is by default true since sometime late 2025, so not needed anymore)
If someone doesn't specifically say prefill then they always mean decode speed. I have never seen an exception. Most people just ignore prefill.
Here is llama-bench on the same M4:
So ~60 for prefill and ~5 for output on 27B and about 5x on 35B-A3B.