I’ve been working on this for about a year through four major rewrites. Godogen is a pipeline that takes a text prompt, designs the architecture, generates 2D/3D assets, writes the GDScript, and tests it visually. The output is a complete, playable Godot 4 project.
Getting LLMs to reliably generate functional games required solving three specific engineering bottlenecks:
1. The Training Data Scarcity: LLMs barely know GDScript. It has ~850 classes and a Python-like syntax that will happily let a model hallucinate Python idioms that fail to compile. To fix this, I built a custom reference system: a hand-written language spec, full API docs converted from Godot's XML source, and a quirks database for engine behaviors you can't learn from docs alone. Because 850 classes blow up the context window, the agent lazy-loads only the specific APIs it needs at runtime.
2. The Build-Time vs. Runtime State: Scenes are generated by headless scripts that build the node graph in memory and serialize it to .tscn files. This avoids the fragility of hand-editing Godot's serialization format. But it means certain engine features (like `@onready` or signal connections) aren't available at build time—they only exist when the game actually runs. Teaching the model which APIs are available at which phase — and that every node needs its owner set correctly or it silently vanishes on save — took careful prompting but paid off.
3. The Evaluation Loop: A coding agent is inherently biased toward its own output. To stop it from cheating, a separate Gemini Flash agent acts as visual QA. It sees only the rendered screenshots from the running engine—no code—and compares them against a generated reference image. It catches the visual bugs text analysis misses: z-fighting, floating objects, physics explosions, and grid-like placements that should be organic.
Architecturally, it runs as two Claude Code skills: an orchestrator that plans the pipeline, and a task executor that implements each piece in a `context: fork` window so mistakes and state don't accumulate.
Everything is open source: https://github.com/htdt/godogen
Demo video (real games, not cherry-picked screenshots): https://youtu.be/eUz19GROIpY
Blog post with the full story (all the wrong turns) coming soon. Happy to answer questions.
Great work but why not use C# instead of GDScript?
LLMs are really good at C# (and tscn files for some reason), so that solves the "LLMs suck at GDScript" problem. Also, C# can be cheaper in terms of token usage (even accounting for not having to load the additional APIs): one agent writes the interfaces, another one fills in the details.
Saying this because I had really enjoyed vibecoding a Godot game in C# - and it was REALLY painful to vibecode with GDScript.
How does this stack up against something like Tesana [1], which is also Godot based? Would it be accurate to say that it's like "Tesana but local"?
Context: I've been using agents (both Claude Code and Codex) for my daily work and for personal projects, but always in domains where I had some knowledge and I'm currently happy with them.
I tried using Claude Code to build an RPG game with Godot and GDScript, using free to use assets: a total failure :/
The game was supposed to be many implementation steps long but I asked Claude to first produce a one area demo, so I could test the assets and choose the one I liked. First it produced some garbage using the assets randomly. Then it tried to copy from an existing demo but it had not idea where a door or a path were and at a certain point it even admitted it with something like: "I can't design an usable and nice area: I either make it functional and ugly or I copy and adapt the existing demo but I will have no clue about what is what"
I've never even attempted to develop games before so I'm sure I don't even know the basic concepts, but this use case definitely didn't work for me.
Maybe it could generate the code of the game if I provided the full design?
This is incredible piece of work. I was looking into .claude folder and skim reading it. One thing stood out to me how large it is.
If I'm not mistake how Claude Code or AI agent work, they need everything in 'context' and few tricks to reduce the context size. Sure, but given the number of files you have, how much of the context is consumed by all those claude files vs actual user input?
This actually produces more impressive results than I expected. My understanding was that models are quite poor at spatial reasoning/understanding, so I'm surprised it can generate such good assets. Do you use different models for the 3d generation?
There is not much need for this. I already use claude code with godot to build serious projects, and you only need to point the bot at godot + sourcecode folder, and use C#, then it works like a charm.
Nice set of prompts and skills tho, im grabbing them for personal use.
What is the development loop like with this? There’s a lot of folks successfully building games with agents already on the AI gamedev Discord server. So I’m wondering if there were some shorter paths to your goal. You might want to exchange notes with folks there.
Interesting. But if you claim "prompt in Godot game out", how do you deal with assets? I think assets pipeline is one of the most challenging parts in game dev. Is there anything similar but for Bevy?
Nice work, must have been a pain to get Godot's formats working with Claude. As another commenter suggested the demo videos don't do any justice to this project - yeah it's the magic that you can generate playable (wouldn't say complete myself) games with a single prompt, but the quality of those is exactly why people are so put off by AI slop. If this was a better harness that acted more like a tool I think it would be seen as more useful.
Btw: Have you looked at Tripo3D models' topology? Is it still so bad that if you want to make small edits you have to retopologize the whole thing first?
FWIW as a disclaimer I'm making my own game not using AI since I value learning the skills myself, but I am interested to see how fast AI tools adopt to gamedev. For now they've been more of a false shortcut in anything else than prototyping and semantic search ("I need to achieve this visual effect, what algorithms should I look up").
I think this is a cool tech demo. But the commonality I see in all of these "let the agent run free" harnesses is that the output is never something I would want to use/watch/play.
I think minimizing the amount of human effort in the loop is the wrong optimization, and it's the reason we end up with "slop".
It's the dream of a lot of people to have a magic box that makes you things you can sell, or enjoy for personal leisure. But LLMs are not the magic box. And there may not ever be a magic box. The sooner we can accept that the magic box isn't in the room with us, then the sooner we can start getting real utility out of LLMs.
TLDR: Human taste is more important than building things for the sake of building them.
“Real games” the most incomplete bullshit you ever saw passed off as a game.
The starting points of Three.js examples are more of a game than anything here.
Stop saying AI is building games when it can’t even build a standard web page to match a mockup.
Upvoting this! And thanks!
[flagged]
I saw the demo video, in all honesty, they felt really lifeless to me. The snowboard one was the one that most caught my attention but then the mechanics, and movements of the character, made it seem like it's really bad physics. Do you have a published game I could try rather than these demos? I'm curious