Yeah, the question in the title can be answered: "by using gpt-4o, a model 2 years behind the frontier, to serve audio responses"