> The better question is: what is the acceptable threshold?
Currently we are unable to answer that question. AND THAT'S THE PROBLEMI'd be fine if we could. Well, at least far less annoyed. I'm not sure what the threshold should be, but we should always try to minimize it. At least error bounds would do a lot of good at making this happen. But right now we have no clue and that's why this is such a big question that people keep bringing up. We don't point out specific levels of error because they are small and we don't want you looking at them, rather we don't point them out because nobody has a fucking clue.
And until someone has a clue, you shouldn't trust that they error rate is low. The burden of proof is on the one making the claim of performance, not the one asking for evidence to that claim (i.e. skeptics).
Btw, I'd be careful with percentages. Especially when numbers are very high. e.g. LLMs are being trained on trillions of tokens. 10% of 1 trillion is 100 bn. The entire work of Shakespeare is 1.2M tokens... Our 10% error rate would be big enough to spoil any dataset. The bitter truth is that as the absolute number increases, the threshold for acceptable spoilage (in terms of percentage) needs to decrease.
There‘s also the question of „what is it failing at?“.
I‘m fine with 5% failure if my soup is a bit too salty. Not fine with 0.1% failure if it contains poison.