From an AI integration perspective, I am hopeful that Cloudflare may be able to improve "perfor...

ilaksh • today at 4:41 PM • 0 replies • view on HN

From an AI integration perspective, I am hopeful that Cloudflare may be able to improve "performance on the cheap" for replicate's models a little bit.

Replicate has had multiple ways to deploy for auto scaling and you can just keep running periodically to keep the system in a booted and warm state, but that has always seemed like it would be too expensive for a broke bootstrapper like me so I avoided it and model popularity was a big deciding factor. Also because of that and the potential for boot up, in general I avoided it for latency-sensitive things.

I guess there is a limit to what you can do. At some point someone has to spend the money to have the resources stay ready.

But with Cloudflare, theoretically the pool of potential users goes up, and it becomes more likely for someone to have already booted your model.

At the moment I am especially interested in performant and easy ways to run models like "sensefvg/InteractiveOmni-8B" or Qwen 2.5 Omni or models that are even more all in one than that like OpenAI Realtime or Gemini Live.

Now that Ernie 5 launched with (Omni) multimodality built in, I think within six months, developers are going to start to expect speech-to-speech capability from major AI lab releases or product line ups. I feel like eventually the spatial-temporal understanding of video models will be merged in too to make the models understand the world better. But speech in and speech out is closer to being a standard expectation.

Instead of running three models for STT->LLM->TTS with a bunch of tricks like eager end of turn or speculative decoding that basically mean you run the LLM twice or on two different models, and possibly getting shut down by API rate limits, the speech to speech models are a single model that both understands and generates audio as well as text such as for function calls.

This is probably an annoying comment because I am immediately trying to increase the requirements to not only being every model for cheap, but every model for cheap in in a low latency real time streaming way. I just happen to have a contract now that has shown me that multimodal like voice to voice is much more convenient but also much more expensive and fewer options.

Replicate has been so awesome though. Within like a day of me requesting InteractiveOmni, lucataco had it up. So another annoying comment, I sure hope he got paid.

alt Hacker News