> Using the develop web game skill and preselected, generic follow-up prompts like "fix the bug" or "improve the game", GPT‑5.3-Codex iterated on the games autonomously over millions of tokens.
I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?
I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.
I just wanted to say that's a pretty cool demo! I hadn't realised people were using it for things like this.