May I ask you what did you used for the DS4F inference? It is a model with very low hallucination rate in my tests.
Per AA's Omniscience Index benchmark, the "non-hallucination rate" subcomponent (1 - hallucination rate) of 4% for DS4F vs 66% for M2.7.
https://artificialanalysis.ai/leaderboards/models?weights=op...
Btw, a few data points:
1. DS4F can run on a 128GB MacBook. M2.7 is larger (8 bit weights of routed experts). There is to see how it holds at 4 bits. At 2 bits it may not work well at all.
2. Just the KV cache of M2.7 would take ~50GB for 200k tokens AFAIK. It does not have the compressed KV cache that DS4F features.
3. The models are very similar in performances, despite all that. And DS4F is likely getting an update soon.
So it is basically a quasi-frontier model that can run on a 96/128GB MacBook at large context windows. That's non trivial. Likely a coding version could be released in the future.