It works, I've shipped this as a "local inference"/poor person's ollama for low-end llm tasks like search. The main win is that it's free and privacy preserving, and (mostly) transparent to users in that they don't have to do anything, which is great for giving non-technical users local inference without making them do scary native things.
But keep in mind the actual experience for users is not great; the model download is orders of magnitude greater than downloading the browser itself, and something that needs to happen before you get your first token back. That's unfixable until operating systems start reliably shipping their own prebaked models that an API like this could plug into.
> operating systems start reliably shipping their own prebaked models
Here's to hoping that that dystopia will never happen.
> It works, I've shipped this as a "local inference"/poor person's ollama for low-end llm tasks like search
fantastic!
> the model download is orders of magnitude greater than downloading the browser itself, and something that needs to happen before you get your first token back
sure but does this mean the model is lazily downloaded? that is, if I used this and I am the first time the model was called, the user would be waiting until the model was downloaded at that point?
that sounds like a horrible user experience - maybe chrome reduces the confusion by showing a download dialog status or similar?
also, any idea what the on disk impact is?
> That's unfixable until operating systems start reliably shipping their own prebaked models that an API like this could plug into.
Maybe the next big thing will be some software subscription premium offers with a bunch of 5090s as an extra.