"But I can’t communicate directly from client -> LLM Service without leaking the API key."
There is a way you can do that right now: the OpenAI WebRTC API introduced the idea of an "ephemeral key": https://platform.openai.com/docs/guides/realtime-webrtc
This provides a way for your server to create a limited-time API key for a user which their browser can then use to talk to OpenAI's API directly without proxying through you.
I love this idea, but I want it for way more than just the WebRTC API, and I'd like it for other API providers too.
My ideal version would be a way to create an ephemeral API key that's only allowed to talk to a specific model with a specific pre-baked system prompt (and maybe tool configuration and suchlike) and that only works for a limited time and has a limited token budget.
interesting, will check that out. thanks!