These models always seem great, until you actually use them for real tasks. The reliability goes way down, you cant trust the output like you can with even a lower end model like 4o. The benchmarks aren't capturing some kind of common sense usability metric, where you can trust the model to handle random small amounts of ambiguity in every day real world prompts
That seems like both a generalization and hyperbole. How are you envisioning this being deployed?
Fair point. Actually probably the best part about having beaucoup bucks like Open AI is being able to chase down all the manifold little ‘last-mile’ imperfections with an army of many different research teams.