logoalt Hacker News

Gemini 3 Pro vs. 2.5 Pro in Pokemon Crystal

306 pointsby alphabettinglast Tuesday at 12:48 PM90 commentsview on HN

Comments

orbital-decayyesterday at 6:58 PM

The baked-in assumptions observation is basically the opposite of the impression I get after watching Gemini 3's CoT. With the maximum reasoning effort it's able to break out of the wrong route by rethinking the strategy. For example I gave it an onion address without the .onion part, and told it to figure out what this string means. All reasoning models including Gemini 2.5 and 3 assume it's a puzzle or a cipher (because they're trained on those) and start endlessly applying different algorithms to no avail. Gemini 3 Pro is the only model that can break the initial assumption after running out of ideas ("Wait, the user said it's just a string, what if it's NOT obfuscated"), and correctly identify the string as an onion address. My guess is they trained it on simulations to enforce the anti-jailbreaking commands injected by the Model Armor, as its CoT is incredibly paranoid at times. I could be wrong, of course.

show 1 reply
bbondoyesterday at 3:46 PM

1.88 billion tokens * $12 / 1M tokens (output) suggests a total cost of $22,560 to solve the game with Gemini 3 Pro?

show 6 replies
oceanskyyesterday at 3:23 PM

"Crucially, it tells the agent not to rely on its internal training data (which might be hallucinated or refer to a different version of the game) but to ground its knowledge in what it observes. "

Does this even have any effect?

show 9 replies
soulofmischiefyesterday at 3:26 PM

Nice writeup! I need to start blogging about my antics. I rigged up several cutting edge small local models to an emulator all in-browser and unsuccessfully tried to get them to play different Pokémon games. They just weren't as sharp as the frontier models.

This was a good while back but I'm sure a lot of people might find the process and code interesting even if it didn't succeed. Might resurrect that project.

show 1 reply
krigetoday at 6:03 AM

As a fun comparison, Gemini 3 Pro took 17 days to beat the game. Twitch Plays Pokemon, which was frequently random, chaotic, even malicious, took 13 days to clear Crystal.

cg5280yesterday at 4:38 PM

I like the inclusion of the graph at the end to compare progress. It would be cool to compare this directly to competing models (Claude, GPT, etc).

show 1 reply
sussmannbakayesterday at 6:06 PM

So after years of being gleefully told that AI will replace all jobs an omniscient state of the art model, with heavy assistance, takes more than two weeks and thousands of dollars in tokens to do what child me did in a few days? Huh.

show 3 replies
squimmy26yesterday at 3:55 PM

How certain can we be that these improvements aren't just a result of Gemini 3 Pro pre-training on endless internet writeups of where 2.5 has struggled (and almost certainly what a human would have done instead)?

In other words, how much of this improvement is true generalization vs memorization?

show 3 replies
topaz0today at 3:25 AM

Who do I have to talk to to get somebody to pay me thousands of dollars to beat a game from the 90s?

dash2today at 2:14 AM

> it often makes early assumptions and fails to validate them, which can waste a lot of time

Is this baked into how the models are built? A model outputs a bunch of tokens, then reads them back and treats them as the existing "state" which has to be built on. So if the model has earlier said (or acted like) a given assumption is true, then it is going to assume "oh, I said that, it must be the case". Presumably one reason that hacks like "Wait..." exist is to work around this problem.

dpedutoday at 2:48 PM

Is the code behind this available?

jwrallieyesterday at 3:14 PM

Being through the game recently, I am not surprised Goldenrod Underground was a challenge, it is very confusing and even though I solved it through trial and error, I still don't know what I did. Olivine Lighthouse is the real surprise, as it felt quite obvious to me.

show 1 reply
wild_pointeryesterday at 3:18 PM

I wonder how much of it is due to the model being familiar with the game or parts of it, be it due to training of the game itself, or reading/watching walkthroughs online.

show 1 reply
reilly3000yesterday at 8:11 PM

I’d love to see how the new flash-3 model would fare.

elifyesterday at 6:25 PM

Give it the gameFAQ next time