I've been running whisper large-v3 on an m2 max through a self-hosted endpoint and honestly the accuracy is good enough that i stopped bothering with cleanup models. The bigger annoyance for me was latency on longer chunks, like anything over 30 seconds starts feeling sluggish even with metal acceleration. Haven't tried whisperkit specifically but curious how it handles longer audio compared to the full model.
Ah yeah, longform is interesting.
Not sure how you're running it, via whichever "app thing", but...
On resource limited machines: "Continuous recording" mode outputs when silence is detected via a configurable threshold.
This outputs as you speak in more reasonable chunks; in aggregate "the same output" just chunked efficiently.
Maybe you can try hackin' that up?