Why doesn’t Apple?
Apple has Metal, which is already pretty well-integrated in llama.cpp, various Python libs, and mistral-rs & candle. Unpopular opinion, but Vulkan is hot garbage and the definition of "design by committee." There's a reason people still prefer CUDA, whereas most code could likely be programmatically ported anyway.
Vulkan is not Apple.
Metal is Apple's API.
After the steep increase in sales of Mac Studios specifically for LLMs, I'm waiting for Apple to release a frontier level model, optimized for highest end of apple hardware (probably would be hardware locked by a certain neural processor needed (which would then lock the memory config).
The built in Apple Intelligence right now is very small, but even just having a small LLM you know is always there, online, fast and ready makes you think about building app differently. I would love the context to expand from the meager ~4K tokens.
Like with all new tech trends, it takes them a hot minute to catch up, but it's highly likely they will (eventually) release some killer platforms for local AI. The shared memory, high bandwidth and power-efficiency of their M chips is a near-ideal architecture. If/when they finally push out the M5-ultra, that could be round one (albeit still not at the best price/performace vs comparable cloud api tokens). A real mass-market killer device for local LLMs is still going to require some remediation of the global DRAM shortages, and maybe the M6/M7 generation.