Context is your limitation, on the M5. The larger your model is, the longer you'll be waiting on token prefill. TFTT with 0 tokens of context isn't a real-world benchmark.
That's why most professional inference solutions reach for GPU-heavy hardware like the Jetson. Apple Silicon seems like a strange and overly expensive fit for this use cae.
Context is your limitation, on the M5. The larger your model is, the longer you'll be waiting on token prefill. TFTT with 0 tokens of context isn't a real-world benchmark.
That's why most professional inference solutions reach for GPU-heavy hardware like the Jetson. Apple Silicon seems like a strange and overly expensive fit for this use cae.