logoalt Hacker News

minimaxirtoday at 5:25 PM3 repliesview on HN

That is fair to point out. For those who don't know, ChatGPT Image 2 has an absurd ELO of 1387; compared to the #2 model at 1273, it's over 100 points higher (https://arena.ai/leaderboard/text-to-image). The tradeoff is latency, and ChatGPT Image 2 at High is...slow (~2 minutes at 1024x1024). In both cases it would have skewed the charts here to uselessness.

I want to do a writeup on ChatGPT Image 2 but at this point I don't think people care about nuanced image generation anymore...even though ChatGPT Image 2 crushes all my existing tests.


Replies

revolvingthrowtoday at 7:58 PM

While I have no experience with it personally (no interest in image gen) my aunt was raving about current chatgpt image model for "restoring" / working with old photos - sharpening, changing some small details like ill-fitting background. It takes her a bunch of prompting but eventually she gets things just right. In comparison, current gemini output (supposedly) tends to be subtly off, details aren’t quite right, proportions are subtly changed etc.

This is purely about generating images with people in them, I don’t think she’s doing any logic puzzles with gotchas and specific alignments of differently colored blocks and whatnot

vunderbatoday at 5:57 PM

That arena leaderboard has some questionable results. Anyone who's used these models would know that ranking HiDream above Krea2 is a pretty hot take.

Many of these ELO comparative tests (ArtificialAnalysis is guilty as hell on this as well) also have other problems such as a considerable number of "amateur judges" tending to prioritize aesthetics over actual instruction-following given the prompt.

Also (less a critique of Arena.AI necessarily), but the MAI models are so incredibly locked down (e.g. censored) as to be functionally useless. I have a sneaking suspicion its fallout from Tay.

https://en.wikipedia.org/wiki/Tay_(chatbot)

shmolyneauxtoday at 5:34 PM

I definitely appreciated your post about Nano Banana Pro. It's also a genuinely useful time-capsule for how these systems evolve and where they fall short. I've preferred the output of ChatGPT Image 2. I think a post would be very helpful for folks to see what they're missing.