To be blunt I can't take this product seriously when they don't even run benchmarks. Your prompts make Claude better? Cool: prove it. Methods to evaluate LLM performance exist, they're called evals/benchmarks, and every company that is serious about AI runs them when they release a new version. (Of course benchmarks have their own issues, but squabbling over which benchmark is best and what issues there are is step 2 in being a Serious AI Company and step 1 is running them at all!) The fact that the only proof they have that 6 is better than five is a hacky table in a screenshot from Fable is, honestly, concerning.