Maybe we shouldn't be running these models on laptops with their thermally constrained form factor, and we shouldn't expect quick inference on a par with a large cloud-based platform either, at least not for near-SOTA model quality. It's still worth it to avoid becoming massively reliant on centralized services.
I have a 5070 12 GB laptop GPU and can hit 72 tokens per second in the first couple thousand tokens before dropping to mid-high 50s after about 15k context.
This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.
I don't have enough system RAM to properly handle the large context windows so I don't use local models.