logoalt Hacker News

sigmoid10today at 10:23 AM3 repliesview on HN

Interesting how well a panel of Fable 5 + GPT 5.5 beats the frontier of either one, but if you add Gemini into the mix the panel of three performs worse, not better. To me that sounds like Gemini is worse at the given tasks but better at convincing judges of its solutions. Oh and a Panel of 2 Opus 4.8 models is almost exactly as good as one Fable 5. That smells suspicious. Do we know if that might simply be what Anthropic is doing behind the curtain?


Replies

qsorttoday at 10:34 AM

Yeah, GPT 5.5 + Fable beating either individually is belivable, but 2x Opus > Fable is what makes me a bit dubious about the whole thing. They might be measuring skills that are too specific or benefit a lot from more tokens being thrown at them. Also Claude Code (the harness) is not the best at the moment, that might be part of it as well?

show 1 reply
andaitoday at 2:16 PM

> Interesting how well a panel of Fable 5 + GPT 5.5 beats the frontier of either one, but if you add Gemini into the mix the panel of three performs worse

I'm not seeing that? Did you maybe misread the #2 ranked one as Fable + GPT + Gemini? It's actually Opus + GPT + Gemini.

waysatoday at 10:45 AM

> Oh and a Panel of 2 Opus 4.8 models is almost exactly as good as one Fable 5. That smells suspicious. Do we know if that might simply be what Anthropic is doing behind the curtain?

I wouldn't be surprised if Fable/Mythos is a model distilled from a Panel/Council of Claude instances. Recursive self improvement is something all AI labs must be working on in some way or another.