logoalt Hacker News

metadattoday at 4:31 PM1 replyview on HN

Are you saying gpt-5 produces gibberish 15% of the time? Or are you comparing Mistral gibberish production rate to gpt-5.1's complex task failure rate?

Does Mistral even have a Tool Use model? That would be awesome to have a new coder entrant beyond OpenAI, Anthropic, Grok, and Qwen.


Replies

barrelltoday at 4:47 PM

Yes. I spent about 3 days trying to optimize the prompt to get gpt-5 to not produce gibberish, to no avail. Completions took several minutes, had an above 50% timeout rate (with a 6 minute timeout mind you), and after retrying they still would return gibberish about 15% of the time (12% on one task, 20% on another task).

I then tried multiple models, and they all failed in spectacular ways. Only Grok and Mistral had an acceptable success rate, although Grok did not follow the formatting instructions as well as Mistral.

Phrasing is a language learning application, so the formatting is very complicated, with multiple languages and multiple scripts intertwined with markdown formatting. I do include dozens of examples in the prompts, but it's something many models struggle with.

This was a few months ago, so to be fair, it's possible gpt-5.1 or gemini-3 or the new deepseek model may have caught up. I have not had the time or need to compare, as Mistral has been sufficient for my use cases.

I mean, I'd love to get that 0.1% error rate down, but there have always more pressing issues XD

show 2 replies