Get Shit Done: A meta-prompting, context engineering and spec-driven dev system

332 points • by stefankuehnel • yesterday at 8:23 PM • 162 comments • view on HN

Comments

There are so many of these "meta" frameworks going around. I have yet to see one that proves in any meaningful way they improve anything. I have a hard time believing they accomplish anything other than burn tokens and poison the context window with too much information. What works best IME is keeping things simple, clear and only providing the essential information for the task at hand, and iterating in manageable slices, rather than trying to one-shot complex tasks. Just Plan, Code and Verify, simple as that.

➕ show 1 reply

gtirloni • yesterday at 9:33 PM

I was using this and superpowers but eventually, Plan mode became enough and I prefer to steer Claude Code myself. These frameworks are great for fire-and-forget tasks, especially when there is some research involved but they burn 10x more tokens, in my experience. I was always hitting the Max plan limits for no discernable benefit in the outcomes I was getting. But this will vary a lot depending on how people prefer to work.

➕ show 10 replies

joegaebel • today at 6:06 AM

In my view, Spec-Driven systems are doomed to fail. There's nothing that couples the english language specs you've written with the actual code and behaviour of the system - unless your agent is being insanely diligent and constantly checking if the entire system aligns with your specs.

This has been solved already - automated testing. They encode behaviour of the system into executables which actually tell you if your system aligns or not.

Better to encode the behaviour of your system into real, executable, scalable specs (aka automated tests), otherwise your app's behaviour is going to spiral out of control after the Nth AI generated feature.

The way to ensure this actually scales with the firepower that LLMs have for writing implementation is ensure it follows a workflow where it knows how to test, it writes the tests first, and ensures that the tests actually reflect the behaviour of the system with mutation testing.

I've scoped this out here [1] and here [2].

[1] https://www.joegaebel.com/articles/principled-agentic-softwa... [2] https://github.com/JoeGaebel/outside-in-tdd-starter

➕ show 3 replies

anentropic • today at 8:33 AM

I have been using this a lot lately and ... it's good.

Sometimes annoying - you can't really fire and forget (I tend to regret skipping discussion on any complex tasks). It asks a lot of questions. But I think that's partly why the results are pretty good.

The new /gsd:list-phase-assumptions command added recently has been a big help there to avoid needing a Q&A discussion on every phase - you can review and clear up any misapprehensions in one go and then tell it to plan -> execute without intervention.

It burns quite a lot of tokens reading and re-reading its own planning files at various times, but it manages context effectively.

Been using the Claude version mostly. Tried it in OpenCode too but is a bit buggy.

They are working on a standalone version built on pi.dev https://github.com/gsd-build/gsd-2 ...the rationale is good I guess, but it's unfortunate that you can't then use your Claude Max credits with it as has to use API.

randomthought12 • today at 9:15 AM

I tried this but it creates a lot of content inside the repository and I don't like that. I understand these tools need to organize their context somewhere to be efficient but I feel that it just pollutes my space.

If multiple people work with different AI tools on the same project, they will all add their own stuff in the project and it will become messy real quick.

I'll keep superpowers, claude-mem, context7 for the moment. This combination produces good results for me.

AndyNemmity • today at 12:27 AM

I have a ai system i use. I'd like to release it so others can benefit, but at the same time it's all custom to myself and what i do, and work on.

If I fork out a version for others that is public, then I have to maintain that variation as well.

Is anyone in a similar situation? I think most of the ones I see released are not particularly complex compraed to my system, but at the same time I don't know how to convey how to use my system as someone who just uses it alone.

it feels like I don't want anyone to run my system, I just want people to point their ai system to mine and ask it what there is valuable to potentially add to their own system.

I don't want to maintain one for people. I don't want to market it as some magic cure. Just show patterns that others can use.

➕ show 2 replies

maccam912 • yesterday at 8:55 PM

I've had a good experience with https://github.com/obra/superpowers. At first glance this looks similar. Has anyone tried both who can offer a comparison?

➕ show 6 replies

bubblerme • today at 8:08 AM

The spec-driven approach resonates. I've found that the quality of the initial context you feed to AI coding tools determines everything downstream. Vague specs produce vague code that needs constant correction.

One pattern that's worked well for me: instead of writing specs manually, I extract structured architecture docs from existing systems (database schemas, API endpoints, workflow logic) and use those as the spec. The AI gets concrete field names, actual data relationships, and real business logic — not abstractions. The output quality jumps significantly compared to hand-written descriptions.

The tricky part is getting that structured context in the first place. For greenfield projects it's straightforward. For migrations or rewrites of existing systems, it's the bottleneck that determines whether AI-assisted development actually saves time or just shifts the effort from coding to prompt engineering.

yoaviram • yesterday at 9:15 PM

I've been using GSD extensively over the past 3 months. I previously used speckit, which I found lacking. GSD consistently gets me 95% of the way there on complex tasks. That's amazing. The last 5% is mostly "manual" testing. We've used GSD to build and launch a SaaS product including an agent-first CMS (whiteboar.it).

It's hard to say why GSD worked so much better for us than other similar frameworks, because the underlying models also improved considerably during the same period. What is clear is that it's a huge productivity boost over vanilla Claude Code.

➕ show 1 reply

Frannky • yesterday at 11:28 PM

I tried it once; it was incredibly verbose, generating an insane amount of files. I stopped using it because I was worried it would not be possible to rapidly, cheaply, and robustly update things as interaction with users generated new requirements.

The best way I have today is to start with a project requirements document and then ask for a step-by-step implementation plan, and then go do the thing at each step but only after I greenlight the strategy of the current step. I also specify minimal, modular, and functional stateless code.

DamienB • yesterday at 10:55 PM

I've compared this to superpowers and the classic prd->task generator. And I came away convinced that less is more. At least at the moment. gsd performed well, but took hours instead of minutes. Having a simple explanation of how to create a PRD followed by a slightly more technical task list performed much better. It wasn't that grd or superpowers couldn't find a solution, it's just that they did it much slower and with a lot more help. For me, the lesson was that the workflow has changed, and we that we can't apply old project-dev paradigms to this new/alien technology. There's a new instruction manual and it doesn't build on the old one.

recroad • yesterday at 11:01 PM

I use openspec and love it. I’m doing 5-7x with close to 100% of code AI generated, and shipping to production multiple times a day. I work on a large sass app with hundreds of customers. Wrote something here:

https://zarar.dev/spec-driven-development-from-vibe-coding-t...

➕ show 2 replies

gbrindisi • yesterday at 9:04 PM

I like openspec, it lets you tune the workflow to your liking and doesn’t get in the way.

I started with all the standard spec flow and as I got more confident and opinionated I simplified it to my liking.

I think the point of any spec driven framework is that you want to eventually own the workflow yourself, so that you can constraint code generation on your own terms.

➕ show 1 reply

galexyending • yesterday at 11:12 PM

I gave it a shot, but won't be using it going forward. It requires a waterfall process. And, I found it difficult, and in some cases impossible, to adjust phases/plans when bugs or changes in features arise. The execution prompts didn't do a good job of steering the code to be verified while coding and relies on the user to manually test at the end of each phase.

visarga • today at 3:04 AM

I did a similar system myself, then I run evals on it and found that the planning ceremony is mostly useless, claude can deal with simple prose, item lists, checkbox todos, anything works. The agent won't be a better coder for how you deliver your intent.

But what makes a difference is running plan review and work review agent, they fix issues before and after work. Both pull their weight but the most surprising is the plan-review one. The work review judge reliably finds bugs to fix, but not as surprising in its insights. But they should run from separate subagents not main one because they need a fresh perspective.

Other things that matter are 1. testing enforcement, 2. cross task project memory. My implementation for memory is a combination of capturing user messages with a hook, append only log, and keeping a compressed memory state of the project, which gets read before work and updated after each task.

vinnymac • today at 1:57 AM

I tried this for a week and gave up. Required far too much back and forth. Ate too many tokens, and required too much human in the loop.

For this reason I don’t think it’s actually a good name. It should be called planning-shit instead. Since that’s seemingly 80%+ of what I did while interacting with this tool. And when it came to getting things done, I didn’t need this at all, and the plans were just alright.

melvinroest • yesterday at 11:04 PM

If you want some context about spec-driven development and how it could be used with LLMs I recommend [1]. Having some background like helps me to understand tools like this a bit more.

[1] https://www.riaanzoetmulder.com/articles/ai-assisted-program...

dfltr • yesterday at 8:53 PM

GSD has a reputation for being a token burner compared to something like Superpowers. Has that changed lately? Always open to revisiting things as they improve.

obsidianbases1 • yesterday at 8:53 PM

> If you know clearly what you want

This is the real challenge. The people I know that jump around to new tools have a tough time explaining what they want, and thus how new tool is better than last tool.

➕ show 1 reply

hexnuts • today at 7:09 AM

You are missing one important bit. Semantic Gravity Sieves. Important data in the metadata collapses together, allowing grouped indexing. Something like a DAG allows the logic to be addressed consistently.

jankhg • yesterday at 10:35 PM

Apart from GSD and superpowers, there's another system, called PAUL [1]. It apparently requires fewer tokens compared to GSD, as it does not use subagents, but keeps all in one session. A detailed comparison with GSD is part of the repo [2].

[1] https://github.com/ChristopherKahler/paul

[2] https://github.com/ChristopherKahler/paul/blob/main/PAUL-VS-...

➕ show 1 reply

theodorewiles • yesterday at 11:07 PM

I think the research / plan / execute idea is good but feels like you would be outsourcing your thinking. Gotta review the plan and spend your own thinking tokens!

jamesvzb • today at 9:15 AM

old article but still relevant. some things don't change

smusamashah • yesterday at 11:35 PM

There should be an "Examples" section in projects like this one to show what has actually been made using it. I scrolled to the end and was really expecting an example the way it's being advertised.

If it was game engine or new web framework for example there would be demos or example projects linked somewhere.

arjie • yesterday at 9:53 PM

I could not produce useful output from this. It was useful as a rubber duck because it asks good motivating questions during the plan phase, but the actual implementation was lacklustre and not worth the effort. In the end, I just have Claude Opus create plans, and then I have it write them to memory and update it as it goes along and the output is better.

➕ show 1 reply

chrisss395 • yesterday at 10:58 PM

I'm curious if anyone has used this (or similar) to build a production system?

I'm facing increasing pressure from senior executives who think we can avoid the $$$ B2B SaaS by using AI to vibe code a custom solution. I love the idea of experimenting with this but am horrified by the first-ever-case being a production system that is critical to the annual strategic plan. :-/

➕ show 1 reply

yoavsha1 • yesterday at 11:34 PM

How come we have all these benchmarks for models, but none whatsoever for harnesses / whatever you'd call this? While I understand assigning "scores" is more nuanced, I'd love to see some website that has a catalog of prompts and outputs as produced with a different configuration of model+harness in a single attepmt

jessepcc • today at 2:35 AM

With the coding slot machine, I prefer move fast and start over if anything goes off track. Maybe the amount of token spent with several iterations is similar to using a more well planned system like GSD.

davispeck • today at 2:39 AM

This looks like moving context from prompts into files and workflows.

Makes sense for consistency, but also shifts the problem:

how do you keep those artifacts in sync with the actual codebase over time?

MeetingsBrowser • yesterday at 8:52 PM

I've tried it, and I'm not convinced I got measurably better results than just prompting claude code directly.

It absolutely tore through tokens though. I don't normally hit my session limits, but hit the 5-hour limits in ~30 minutes and my weekly limits by Tuesday with GSD.

➕ show 1 reply

dhorthy • yesterday at 9:44 PM

it is very hard for me to take seriously any system that is not proven for shipping production code in complex codebases that have been around for a while.

I've been down the "don't read the code" path and I can say it leads nowhere good.

I am perhaps talking my own book here, but I'd like to see more tools that brag about "shipped N real features to production" or "solved Y problem in large-10-year-old-codebase"

I'm not saying that coding agents can't do these things and such tools don't exist, I'm just afraid that counting 100k+ LOC that the author didn't read kind of fuels the "this is all hype-slop" argument rather than helping people discover the ways that coding agents can solve real and valuable problems.

➕ show 1 reply

DIVx0 • today at 12:04 AM

I’ve tried GSD several times. I actually like the verbosity and it’s a simple chore for Claude to refresh project docs from GSD planning docs.

Like most spec driven development tools, GSD works well for greenfield or first few rounds of “compound engineering.” However, like all others, the project gets too big and GSD can’t manage to deliver working code reliably.

Agents working GSD plans will start leaving orphans all over, it won’t wire them up properly because verification stages use simple lexical tools to search code for implementation facts. I tried giving GSD some ast aware tools but good luck getting Claude to reliably use them.

Ultimately I put GSD back on the shelf and developed my own “property graph” based planner that is closer to Claude “plan mode” but the design SOT is structured properties and not markdown. My system will generate docs from the graph as user docs. Agents only get tasked as my “graph” closes nodes and re-sorts around invariants, then agents are tasked directly.

➕ show 1 reply

loveparade • yesterday at 11:33 PM

"I am a super productive person that just wants to get shit done"

Looked at profile, hasn't done or published anything interesting other than promoting products to "get stuff done"

This is like the TODO list book gurus writing about productivity

➕ show 2 replies

thr0waway001 • yesterday at 9:53 PM

At the risk of sounding stupid what does the author mean by: “I’m not a 50-person software company. I don’t want to play enterprise theatre.” ?

➕ show 3 replies

Andrei_dev • yesterday at 9:36 PM

250K lines in a month — okay, but what does review actually look like at that volume?

I've been poking at security issues in AI-generated repos and it's the same thing: more generation means less review. Not just logic — checking what's in your .env, whether API routes have auth middleware, whether debug endpoints made it to prod.

You can move that fast. But "review" means something different now. Humans make human mistakes. AI writes clean-looking code that ships with hardcoded credentials because some template had them and nobody caught it.

All these frameworks are racing to generate faster. Nobody's solving the verification side at that speed.

➕ show 7 replies

prakashrj • yesterday at 8:48 PM

With GSD, I was able to write 250K lines of code in less than a month, without prior knowledge of claude.

➕ show 4 replies

ibrahim_h • yesterday at 10:11 PM

The README recommends --dangerously-skip-permissions as the intended workflow. Looking at gsd-executor.md you can see why — subagents run node gsd-tools.cjs, git checkout -b, eslint, test runners, all generated dynamically by the planner. Approving each one kills autonomous mode.

There is a gsd-plan-checker that runs before execution, but it only verifies logical completeness — requirement coverage, dependency graphs, context budget. It never looks at what commands will actually run. So if the planner generates something destructive, the plan-checker won't catch it because that's not what it checks for. The gsd-verifier runs after execution, checking whether the goal was achieved, not whether anything bad happened along the way. In /gsd:autonomous this chains across all remaining phases unattended.

The granular permissions fallback in the README only covers safe reads and git ops — but the executor needs way more than that to actually function. Feels like there should be a permission profile scoped to what GSD actually needs without going full skip.

➕ show 2 replies

Relisora • yesterday at 11:53 PM

Did anyone compare it with everything-claude-code (ECC)?

scuff3d • today at 5:59 AM

Oh boy, if anyone thought productivity hacks, ultra optimized workflows, and "personal knowledge management" systems could get ridiculous, they haven't seen anything yet. This is gonna be the new thing people waste time on now instead of their NeoVim config.

seneca • yesterday at 11:19 PM

I've tried several of these sorts of things, and I keep coming away with the feeling that they are a lot of ceremony and complication for not much value. I appreciate that people are experimenting with how to work with AI and get actual value, but I think pretty much all of these approaches are adding complexity without much, or often any, gain.

That's not a reason to stop trying. This is the iterative process of figuring out what works.

jatora • yesterday at 10:48 PM

Another heavily overengineered AND underengineered abomination. I'm convinced anyone who advocates for these types of tools would find just as much success just prompting claude code normally and taking a little bit to plan first. Such a waste of time to bother with these tools that solve a problem that never existed in the first place.

LoganDark • today at 1:52 AM

This seems like something I'd want to try but I am wholly opposed to `npx` being the sole installation mechanism. Let me install it as a plugin in Claude Code. I don't want `npx` to stomp all over my home directory / system configuration for this, or auto-find directories or anything like that.

canadiantim • today at 1:18 AM

I use Oh-My-Opencode (Now called Oh-My-OpenAgent), but it's effectively the same as GSD, but better imo

hermanzegerman • yesterday at 11:44 PM

For me it was awesome. I needed a custom Pipeline for Preprocessing some Lab Data, including Visualization and Manipulation and it got me exactly what I wanted, as opposed to Codex Plan Mode, which just burned my weekly quota and produced Garbage

noduerme • today at 6:11 AM

Question for people who have spent more time than I have wrangling agents to manage other agents:

I've been using a Claude Pro plan just as a code analyzer / autocomplete for a year or so. But I recently decided to try to rewrite a very large older code base I own, and set up an AI management system for it.

I started this last week, after reading about paperclip.ing. But my strategy was to layer the system in a way I felt comfortable with. So I set up something that now feels a bit like a rube goldberg machine. What I did was, set up a clean box and give my Claude Pro plan root access to it. Then set up openclaw on that box, but not with root... so just in case it ran wild, I could intervene. Then have openclaw set up paperclip.ing.

The openclaw is on a separate Claude API account and is already costing what seems like way too many tokens, but it does have a lot of memory now of the project, and in fairness, for the $150 I've spent, it has rewritten an enormous chunk of the code in a satisfactory way (with a lot of oversight). I do like being able to whatsapp with it - that's a huge bonus.

But I feel like maybe this a pretty wasteful way of doing things. I've heard maybe I could just run openclaw through my Claude Pro plan, without paying for API usage. But I've heard that Anthropic might be shutting down that OAuth pathway. I've also heard people saying openclaw just thoroughly sucks, although I've been pretty impressed with its results.

The general strategy I'm taking on this is to have Claude read the old codebase side by side with me in VSCode, then prepare documents for openclaw to act on as editor, then re-evaluate; then have openclaw produce documents for agent roles in Paperclip and evaluate them.

Am I just wasting my money on all these API calls? $150 so far doesn't seem bad for the amount of refactoring I've gotten, across a database and back and front end at the same time, which I'm pretty sure Claude Pro would not have been able to handle without much more file-by-file supervision. I'm slightly afraid now to abandon the memory I've built up with openclaw and switch to a different tool. But hey, maybe I should just be doing this all on the Claude Pro CLI at this point...?

Looking for some advice before I try to switch this project to a different paradigm. But I'm still testing this as a structure, and trying to figure out the costs.

[Edit: I see so many people talking about these lighter-weight frameworks meant for driving an agent through a large, long-running code building task... like superpowers, GSD, etc... which to me as a solo coder sound very appealing if I were building a new project. But for taking 500k LOC and a complicated database and refactoring the whole thing into a headless version that can be run by agents, which is what I'm doing now, I'm not sure those are the right tools; but at the same time, I never heard anyone say openclaw was a great coding assistant -- all I hear about it being used for is, like, spamming Twitter or reading your email or ordering lunch for you. But I've only used it as a code-manager, not for any daily tasks, and I'm pretty impressed with its usefulness at that...]

desireco42 • today at 2:55 AM

I honestly tried this a while back, unless this is something else, this was completely not very much useful thing.

If I remember correctly, it created a lot of changes, spent a lot of time doing something and in the end this was all smoke and mirrors. If I would ever use something like this, I would maybe use BMad, which suffers from same issues, like Speckit and others.

I don't know if they have some sponsorship with bunch of youtubers who are raving how awesome this is... without any supporting evidence.

Anyhow, this is my experience. Superpowers on the other hand were quite useful so far, but I didn't use them enough to have to claim anything.

maxothex • today at 5:35 AM

[dead]

greenchair • yesterday at 8:53 PM

terrible name, DOA

➕ show 1 reply

openclaw01 • today at 1:31 AM

[flagged]

alt Hacker News

Get Shit Done: A meta-prompting, context engineering and spec-driven dev system

Comments

🔗 View 1 more comment