I'm a bit annoyed by the feeling that we're kind of stuck when it comes to using LLMs for programming.
I use Claude Code and Codex, but I haven't been able to enter flow state like I can when I hand write code.
This is kind of ironic to me since AI should be a bicycle for the mind, but right now it feels like a bicycle that just brakes abruptly every couple minutes. I stop, wait, review, prompt again.
Is there anyone exploring something fundamentally different than the prompt response loop we have today?
I actually think the idea of a tab model is directionally better than prompt response.
Would love to hear about any startups, personal experiments, etc.
I, like many nerds of the same stripe, have a dragon's hoard of every PC component I've owned in the last 20 years. I've attached as much of it to my homelab as is practical, but there's still a pile of GPUs from the last decade plus.
So I decided to load up everything with more then 3GB of VRAM into various machines on the network. Anything that could conceivably run an LLM of any utility. I've been experimenting with driving a swarm of heterogenous LLMs into coding tasks. I have models as small as llama3.2:3b up to Qwen3.6:27b dense. Over 10 unique models in the swarm.
So far, the results are... interesting. Coding isn't great, but what has worked shockingly well is polling the swarm for opinions. Getting ten unique perspectives synthesized into a single summary has been astonishingly useful. When I gave the swarm the ability to debate with itself, the results got even more interesting.
The end goal here is an autonomous routing network that learns which models excel at which tasks, which machines can fit which models, and intelligently routes requests and models to where they're most effective.
I can't afford an RTX 6000, but I can run smaller models on the pile of GPUs I do have. So far it hasn't worked out the way I'd hoped, but it did turn out to be very useful in other ways. Hopefully soon I can get coding worked out and the swarm can drive itself into self-improvement
"I haven't been able to enter flow state like I can when I hand write code." new flow state is having 10 terminal tabs in diff worktrees and trying to remember what each bit is
I had this exact problem, tried opening multiple agents in different terminals but that just frays your flow state even more. There is one great workaround I’ve found.
Walk coding. Walkoding, if you like.
Use a harness, create a harness if you like, then load it up in telegram and off you go. I’ve been on solo hiking trips and shipped numerous features. It means you can stay concentrated on your task, while not sitting there being bored.
It’s truly liberating, highly recommend.
I'm working an an agentic graph-based workflow execution engine/framework. The concept of the harness is completely abstracted away/generified - a 'node/agent's is a harness (cc, codex, open code, pi, etc) + model (I test different model and harness combinations). I have a set of tasks from trivial to complex - a set workflows (a workflow is a set of initial nodes and their behaviour) is defined and each one is asked to perform each task (multiplied by each harness/model combination roughly). The workflow can include agents/nodes which are able to modify the workflow graph and create nodes. Other nodes can break down tasks and send subtasks to other nodes. Mostly experimental stage at this point. I'm exploring/tracking metrics such as total wall clock time to complete a task, total cost in tokens and $, among others. This gives me a decent amount of data/insight into the abilities/performance of different harness/agents/models for different tasks, and gives me a great testing/dogfooding of my own harness (which is one of the harnesses being tested, and as of now the most efficient one).
The main bottleneck at this point is the cost of all of the tokens in the fairly large test matrix of tasks, harnesses, models.
I hope to release/open source all of this stuff eventually.
Yeah, flow state is not the same, and I miss it.
I've stumbled into a couple of different ways to work with AI, each with their advantages and disadvantages:
* Ask it to solve the issue and trust the results. You're outsourcing your thinking, and lose understanding of the code. The result might work, or it might not. Chances are it will work, but your code slowly grows messier.
* Ask it to solve the issue and review the results. This should help you understand the resulting code, and give you a better chance of setting the AI straight when it messes up. But you're still outsourcing your thinking, and not thinking of the solution yourself still means you lose touch with the code. But more importantly, reviewing is the boring part of software development.
* You write the code, and let the AI review it. In a way, I think this should be the sweet spot. It doesn't make you faster, but your quality should go up. AI is very good at reviewing, and often finds issues that humans skip over. This is the quality over quantity solution. More than code, I think this is particularly important for writing high-stakes non-code documents, like financial reports for customers. Quality is really important there.
* You tell the AI how to solve the issue. The AI still writes the code, but it receives tighter guidance from you. This is what I usually end up doing. I like to think this results is better code. It certainly gives me the impression I understand better what the code does. And I do tackle much larger amounts of code, but I check what the AI does, and often push back on its suggestions and assumptions. I think this is a nice middle ground between speed and quality.
* Full agent mode. Let the AI do everything. Let multiple agents work simultaneously doing everything. You lose control and your mental model. You're going to have to trust whatever the AI is doing. Something it will be correct, sometimes not. Let's hope you never have to personally touch that code anymore. But it sure is fast.
2 to 5 tabs in warp. Kinda want to figure out how to properly use iterm+tmux to have the boris cherny experience. already used both but there were issues with tmux with either scroll back, or copy paste or other similar things that get broken, even after mesing with settings.
i use git worktrees in different tabs as needed.
i have git push hooks that audit the code diffs for security issues by 2+ frontier models For code quality with a FAIL/CLOSED condition where both have to give the OK.
i have to do a pass and ask it to shorten the code, remove unnecessary comments and excessive exception gathering, etc. Generally cuts the code by half. The process is repeatable.
i just use claude code or codex with minimal plugins (HUD, frontend design). I would do even more if I had 10x or 100x the tokens and/or token/s available. I spend a lot of time waiting on 5.5 or Fable 5 to work, even when multi-tasking.
I spend the downtime writing detailed follow up or unrelated prompts.
Venkatesh Rao had this idea about matching the type of work you choose at a given time to the mood state you are currently in (or finding cues to shift yourself into the mood state that best matches the kind of work you need to be doing).
Pick a few personas to develop that match different aspects of your personality and map to different mood states you might be in. Not so much defining roles, but different work styles that you work in. Write prompts for each of them.
Keep a task list your agent has access to. Feed it your personas as well.
Have an agent pick one task and to frame the work through the lens of each of your personas and to ask it to ask you to choose which one you would like to pick.
Schedule it to run each morning and trust your gut with whichever one you pick.
You have now put yourself on the opposite end of the prompt response loop and the agent is prompting you for a response.
You'll be in a flow state in no time.
I've had the same experience and it has really put me off working on personal projects using AI quite a few times, though I keep coming back. My recent experience with Fable and the latest Sonnet have actually been very positive though. They seem to be capable of working for much longer stretches without constantly stopping and requiring further prompting to finish up large features. The place where I feel flow the most now is when I'm planning large feature sets and I use the Claude web app as a sounding board while doing this and prompt it to not write any code but focus only on asking questions and clarifying aspects that I haven't fully thought out. This results in a very detailed plan for implementation which seems to work very well with Fable or even Sonnet 5. I can then actually leave my computer and go do other things without the constant nagging feeling that I need to check if its stopped and needs nudging. After its done I review the changes and do QA and testing and then formulate a plan for the next set of tasks. It feels like this problem is getting solved, which is wonderful.
Try this prompt: While working on the main task, launch a parallel sub-agent with the task context so far. The sub agent should think of high quality questions and put them to the user using a dialogue tool like zenity. Customize the inputs to the question, taking full advantage of the dialogue tools features to create a progressive interactive user experience. Ask only a few questions per turn so that you can adapt the questions to the answers.
This will keep you busy while the main agent runs. Customize it further to integrate the sub-agent answers to the main thread.
I have been using Claude code and cursor daily for the past 9 months. Here is what I learnt:
1. In my experience, well-articulated prompts are the most important part. You need to tell the model exactly what and how to do to avoid hallucinations. Especially in system design, write what the end result should be, how and let the model reason and look at the existing infra first, then plan the implementation. In my experience, there is little to no coding that needs to be done after model is done implementing. Make sure to let it implement in phases, with extensive tests.
2. Model choice. It is obvious, but Claude models are the current SOTA. In my experience, Opus 4.7 extra high is the perfect balance of speed and cost-efficiency. In my experience, OpenAI models were worse in system design, but faster and better at understanding the end result. Mostly used them to verify the bigger picture. Also used composer in Cursor. Was surprised how easy it was to do web design with it.
3. Long horizon tasks. Make models build plans. Very thorough plans - for a feature or a product. It is much more aligned with a written plan.
There are more details, but this is what I noticed so far myself.
Not being able to enter flow state is a very interesting observation. I've felt it too to the extent that I went down a whole new rabbit hole of what it means to be in flow state. Let me know if anybody here wants to know more, happy to post some links.
To answer your question - I discuss the approach with Claude Code (e.g., should I implement my own ACT model in JAX or PyTorch, Python or Rust or Julia, etc.). Then write the initial part of the code myself. Opening up a blank vscode is a simple joy of life I refuse to give up :-) I'll ask Claude for advice if I get stuck, it will helpfully offer to write that code for me, I obstinately decline. Eventually, I'll get bored of some minutiae or other, at which point I'll ask Claude to complete just that part of it.
I'm in the same boat and I'm not a fan on the current way of working of agents, but I think tooling is what needs to catch up.
So, I actually decided to try to tackle it myself and worked some months (full time) on it.
https://beolis.com is the result of that, it's a local cli in a kanban board style with a remote server to keep the team on track (I've been using it myself for some time and actually started to ask some friends to use it just yesterday -- feedback very welcome, I still wanted to do some additional things before asking more people to use it, but oh well, I'm a fan of building in public anyways and it's probably better to have feedback sooner rather than later).
The main point there is that you work mostly in the ticket description (your own spec) and the plan (the spec as the agent sees it, generated with a custom workflow) and then having another custom workflow to implement it (you can choose how you want it -- https://beolis.com/blog/post/custom-coding-workflows has some info on what I'm using myself).
As a result, at least for me, I do spend more time immersed in a flow state (although I'm in that state writing the specs and reviewing code -- although in some cases it's more work to write the spec in a way the agent can work when things get more complicated vs just diving into the code, so, going into "code" mode is something I still have to do, agents are definitely not perfect).
I guess I'm lacking in docs on how to effectively use it. I have plans to create a video next week and post it in the blog, so, if you're interested, keep track of it ;)
My absolute favorite modality is one I don't use all that much at the moment: Zed's edit completion.
If you're unfamiliar, it's like tab-completion, but it has a context that includes the edits you've made in the last few seconds, and it can predict around the cursor.
The model isn't advanced enough to understand complex tasks, but it has more the feel of the "crafting gun" in Subnautica or other survival crafting games, if that analogy makes any sense.
Personally I hate working with a chatbot - it's low-bandwidth and rage-inducing. If I could imagine a perfect workflow, it would be something like me whispering my train of thought as I program, and then pointing a very fancy "autocomplete gun" at the code.
I got really annoyed at how slow the LLMs are so I found myself doing more and more prompts like "in file X do Y", if you can piecemeal your task into very limited prompts and periodically reset the context on each go the LLMs do the work MUCH faster. Since I reset the context all the time I often do manual changes in-between prompts.
But at the same time if you do like this you can't do that insane multitasking I see a lot of devs doing where they juggle multiple agents doing separate tasks (maybe even completely separate tasks on different git branches). I _really_ hate working like this and only do it if I know on of my prompts will take 10+ min. Usually an initial prompt for a large task where I will move to my usual 1-file-per-prompt style later.
Of note that I am working on a ~10 year old codebase with a lot of custom instructions for agents and a lot of code to dig through to get things done. I feel a lot of people are conflating using LLMs to start hobby projects and extrapolating the workflow to real world large codebases.
I have gotten a lot of mileage out of giving LLMs narrow focus over the same plan or code change with different objectives. Write a plan, ask the agent to review the plan and consider where code can be consolidated. Consider downstream effects. Consider security risks, consider optimization, consider architectural concerns, etc. Then I generate the code and go through a similar loop. Then I read code and do more loops to clear out any problems or investigate things I'm unclear about.
In this way I spend most of my time building understanding of existing code and understanding the impact of my changes. My company is heavy into AI use and I find I am pushing out more code and much cleaner code than most. The gaps that appear during review are usually product understanding gaps and not code failures, and my LLM spend is somehow less then most.
I find this iterative process is much more inline with building flow than spending 3 hours writing a spec and wait a half hour for it to build a monolith PR.
Looks like my approach is as old school as it gets. For complex piece of work, the only way to be in flow for me is when I'm driving the engagement. I start with a high level requirement and a high level design plan to achieve it. I also provide constraints that needs to be satisfied (efficiency, performance, cost, scale etc) and write it all into a markdown document and ask LLMs to review it, find blindspots and refine it until I can get a detailed design for that phase.
Then I pass that to another LLM provider to review and check for any bloat that can be cut or blindspots that need to be addressed.
Finally I get a test plan to help me test different components directly and in debug mode. I then ask LLMs to implement in stages where I can test them in small components as possible.
I think I end up spending more time (easily 2x) than hand rolling. But the upside is the design is more thought out compared to hand rolled code. It has fewer accidental complexities and I have a clear mental model of the entire design that can also be shared to others through the document
I've built a bunch of projects with Claude Code, and the flow that works for me is planning as much as I can up front. Once I'm confident we've caught most of the requirements and the likely gotchas, I let CC run and just check back periodically to approve continuations until it's done and ready for me to test. One project at a time keeps me in a flow state.
A more exciting attempt the other day: I fed CC a PRD and ui-spec I'd drafted with Fable 5 (no reason, I just happened to be on my phone), told it to auto-approve commands, and let it fully build and test the project in Chrome overnight. I woke up to a mostly-working app ready for me to QA. Skipping the whole build cycle and going straight to watching it come together during testing was genuinely great.
The one thing I'm still figuring out is the trust side of fully-unattended runs like that. Curious how others are handling auto-approval when nobody's watching.
YES!
It's still very wip, I spent a couple of weekends on it so far, but I'm working on a harness that eschews autonomy and instead aims to work as a pair programming partner. Key to that are distinct "driver" and "navigator" modes, with the capacity to flip between them rapidly.
https://gitlab.com/philbooth/opair
(not really usable yet, but after tomorrow's session I expect to be developing opair in opair, which is mildly exciting)
Yes. I built recently an agent that has very broad set of objectives and nothing in particular. I don't even know what it does most of the time but hopefully it will do something useful eventually.
You track its progress here https://github.com/relentlessworks
I'm building "workboxes" to work on my startup. It helps me develop features insanely fast. A workbox is a simple worktree-in-a-sandbox per feature. I have a simple front end where I can launch new workboxes: I input a prompt (a documented grilling session) and it creates a branch, a PR, and starts an opencode coding session on an e2b sandbox based on a custom template with the app's monorepo. Each workbox has a public https endpoint so I can manually test the web app after the coding session is complete. At any point I can either approve the PR, send a follow-up prompt, or connect to the opencode session for more control.
I think my next step is to perform the grilling session inside the front end, currently I perform it in my terminal and then paste in the front end.
I think trying to iterate on a "spectral decomposition of your intent" - slowly working on increasinly refined breakdowns of what are the different aspects of your project are - both on the domain- and also the technical level; aka requirements and architecture. And then don't directly iterate on the code but rather regenerate/update the codebase based on the new intent and the old codebase... And a decomposition of the whole thing in terms of optics (open lenses, etc) where the decomposition respects the "spectral decomposition".
Yes, like many others I've been experimenting a lot. What I've got so far is a harness-of-harnesses - ie, a harness which sits on top of Claude Code, Codex or OpenCode. I still use Claude Code or Codex directly for the initial planning of features, to investigate issues, and for small fixes, but whenever there's something even just a bit complex to do, I use my second-level harness.
Summarizing it a lot, what it does is:
* help you make better plans
* split plans into iterations, in a module-aware way for projects which have strict modularity (for now I'm doing this specifically with TypeScript and dependency cruiser) - this helps a lot when a project becomes complex
* ask an agent to implement an iteration, and then programmatically run a lot of checks after each iteration - not just regression tests, but also checks against project principles and conventions
* when possible, automatically fix deviations; when not possible, raise them to myself for an end-of-plan review
In this way, instead of having to constantly be engaged with the chat interface, with all the shorter or longer wait times which break my flow, I spend a lot of highly focused time during initial planning and final review. A plan implementation can go on for hours, and the various anchoring mechanisms added to the tool keep drift to a minimum.
At some point I'm planning to release this tool as open source. As this is the result of months of trial and errors, dogfooding, and vibecoding on the tool itself, the codebase is chaotic and the UI is still full of experiments I mostly basically abandoned, and I'm not used to releasing stuff in this status. But perhaps, in this brave new world, I should just do it and see what happens?
I’m exploring agentic for creative coding / visual effects. Think touch designer but with a prompts graph. It compiles to native code (swift and metal). The first version is available here and I’ll be releasing a v2 soon that will be open source:
We're working on a browser-harness that makes forking, rpcs, and mapreduce first class tool calling primitives. Among other things, this makes it easier to manage your own context, because you can visualize your agents, subagents, and active work and resources as they interact with each other across locally and remote environments. And it eliminates all the complexity of mcp and local sandboxing because that is literally the problem browsers were made to solve!
To be clear the browser IS the harness, it's not just a browser-based UI but also the sandbox and orchestration layer. By giving LLMs deep browser access (through CDP and some special hooks) they can verify their own UIs immediately after writing them, navigate the web natively, and run commands that directly manipulate the active DOM. This creates a very tight feedback loop for UI work, but also let's you create or run browser automations, or query a site by running a javascript query on its contents, or a web page without deploying or uploading it anywhere, which is pretty powerful. What I really like is that this makes it easy to dispatch cheap models to generate and verify tons of little visualizations using svg.
Locally it's just a browser, but to manage remote instances you can either access them as tabs on any local browser, or as inline collapsible iframes. I'm trying to be cautious with the security side of it so we're not marketing it as a product yet, but would love to work with some anybody who is interested and does a lot of UI or cloud work!
I'm excited about this particular moment in tech because I think work is going to end up looking like playing Starcraft with data and AI, surrounded by rich custom media as you work, which feels really futuristic to me!
I am using claude cli/tmux on a hetzner box and connecting to it via claude remote control. I have connected the box and my phone over telnet which allows me to view any UI work. Sometimes I do have to switch to my laptop for UI desktop layouts.
One gap which I kept running into on both mobile and desktop was refining the initial plan and then later refining the generated artifacts which involved lots of imprecise copy-paste. To scratch my own itch I built a review tool to improve the velocity of planning and refining generated artifacts. It has become my daily driver: https://github.com/livetemplate/prereview
That's how I code nowadays:
1. Start a session.
2. Grill my requirements.
3. Write an ADR, then either start implementing or separate into pieces.
4. Review the code on pyor.review, compared to Github, Pyor allows me to categorize the files and changes then review the important stuff and skim the noise it identifies.
5. Since I can do local reviews with Pyor, I can do that with Claude and feed back my comments to be addressed without it going to Github first.
6. Create a PR then merge it.
I feel that flow state[1] is possible as long as you don't feel distracted into doing other things and you're needed to guide the LLM along every few seconds/minutes (someone mentioned a pair-programming type tool in this thread). For me that works if you have a good spec + workflow tool (assuming you're doing interactive coding and not kicking off long running coding jobs). I feel that a good test of a workflow tool is that it should offload all bookkeeping from you, leaving you to just read the generated code and think about design/architecture.
I built one such tool for myself: https://www.shipsmooth.net. You can use it to spec/plan out a piece of work, and then easily keep updating the spec/plan as you churn through its implementation. The tool assumes that you will pretty much end up changing the spec/plan during implementation, based on how it's going. In general, I don't see how it's possible to one-shot high quality code for custom use cases.
[1] Going by the definition of flow state here: https://en.wikipedia.org/wiki/Flow_(psychology): "fully immersed in a feeling of energized focus, full involvement, and enjoyment in the process of the activity. In essence, flow is characterized by the complete absorption in what one does, and a resulting transformation in one's sense of time."
I mostly use claude code on mobile and desktop with remote sessions and do most of my coding on the go now.
I have been tempted to distill something like GLM 5.2 into a smaller html + css only model for super fast interactive UI editing, because right now it's really annoying to do with large slow models. I'm sure it would be doable to do the same for individual language / frameworks, including potentially doing a final few steps on your own code base with some LoRAs that could be kept up to date to avoid having the model have to explore the code base each time.
Doing UI work with composer 2.5 and live reload is a way different experience than slogging through it with opus 4.8
IME LLMs are kind of like a projection of your current expertise - your prompting and guidance etc. biases LLM plans kind of 'in the direction' of your thinking. I think this is one reason why it seems like senior engineers get more lift vs. juniors.
What I am exploring is another step to the classic 'research / plan / implement' pattern: 'research / plan / LEARN / implement' where LEARN involves the human doing AI tutoring sessions to ensure a deep understanding the concepts etc. that the LLM is planning to implement so you can refine / iterate on plans and direct the LLM in ever more effective ways. My idea is that this then compounds your human capital and reduces the occurance of 'sounds smart, doesn't work' pattern.
Orchestration works very well for me, but not in the way most people seem to be pushing for, with middlemen scoring and routing every request. For coding, the routing is mostly solved at the config level. The harness lets you pin models per role, and that covers most of what a per request router promises.
On your actual question though, I think the loop you're describing does break the flow and gets very frustrating, but it's been a long time since I've experienced this.
Three things happened to me in the past few months: I've become cost conscious, I wanted to get more done faster, and I wanted to be able to do a lot more at the same time (in parallel). With that I developed my own workflow that works well for me. It's a config-led setup routed by tiers: cheap fast models for mechanical work (lookups, log reads), mid models for implementing against written specs, strong models for judgement and review. It's config on top of a standard harness, nothing exotic.
For me...my flow state has moved from tackling code line by line to traversing the layers of the entire system design in my mind, and being able to clearly articulate this to a strong model.
I think the value right now is to focus less on external orchestration if at all. trust the (current best) model to do it better than anything you bolt on to the harness. focus your energy on providing clearer specs. I think the optimal spec is a disambiguated (through liberal use of the AskUserQuestion tool) 1 intent, 2, input/output contracts 3 constraints and 4 preconditions. focus on that and get out of the models way. I think of it like this, imagine a person who was not as smart as you was trying to tell you how to do a task. would you want more verbosity and step by step instructions or would you want them to just cut to the chase (ie, what are you trying to do, what are the obstacles, I'll let you know if I have questions).
also let the model verify itself. don't give it an objective that is vague, give it clear exit criterias for goals and let it loop until it gets there so much of the orchestration scaffolding seems like massive technical debt
oddly, I do the opposite of a lot of conventional advice when it comes to models. I use no memory, I think there is something similar to context rot when everything is stored. I like creating markdown files as memory that the model can grep if needed. I also havent found a real use for hooks yet, I have tried but they always seem to get in the way. skills on the other hand are very undervalued. they are so much more powerful than many realize. I used to think agents were where the power was. I think its actually skills. agents are really for context preservation. skills are what increase capabilities
I'm not even talking about quantity of items in memory, I mean dilution of intent. I really love a model with a clean slate and only the items it needs. I fear the memory guides the model in areas that might not be what I want with the current prompt
progressive disclosure is a big one. you can make context available but it is only loaded when needed. like lazy loading for prompt engineering. skills are to be used to instruct the model how to do something specific that is not in its training data. like how to access my proprietary system, how to interface with a custom program. you can embed templates in skills, you can embed code that executes in skills and only the output is loaded into context. skills expand capabilities, agents constrain context
(constraining context is a very good thing btw, don't mean to infer that agents are somehow inferior to skills)
One of the things I've been talking about with my senior developers is how the bottleneck has shifted even more dramatically to human code understanding vs code generation. AI is still not suitable for generating production grade code without a human checking it (yet), but it can produce a huge amount of code for humans to check. We've been experimenting with ai finding better ways of communicating what is in a change at different abstraction levels etc by always generating diagrams showing what it did etc, with the concept being that anything that can speed up human understanding of changes addresses the core bottleneck of the whole process.
I'm still in the prompt response loop of learning how to be more effective with it, but I've found that what works for me is to approach a project the same way I would if I was writing code by hand. I'll decompose the project into small discreet units of work and slowly build my way up, I find it makes less mistakes using that approach. I built a systems monitoring platform I had been wanting for a long time over the course of days instead of weeks or months, and I was really impressed with Claude's output.
Then I thought it would be fun to be able to monitor the status of all my workflows as buttons on my Stream Deck XL, and Claude was able to build the plugin with almost no issues at all. It's hilarious how much fun it is.
If you like videos, I saw an interesting video yesterday about systems thinking, software as ecosystem particularly with AI. More of an overview but gives an insight into seeing where we might be able to experiment with different ways Its more focused on teams and companies than individual developers but I think it could be applied to the single dev.
"Software engineering at the tipping point" https://www.youtube.com/watch?v=2n41YjR5QfU
Graph based code generation where code doesn’t reside in files in the typical sense. On insertion, modification, and deletion, constraints are checked / ran to see if the change is valid and can be done or not.
I'm in offensive security and use it to write exploit code for various projects I'm working on.
Too many people are using LLMs to shortcut knowledge completely. I have more work than ever fixing the security issues on vibe coded apps, and I don't think it will slow down any time soon.
I have a custom harness that runs in a macOS VM. It has e-mail and its own accounts. I assign it tasks in Linear, it does them and spins up PRs for me to review. This works pretty well, generally. I have to spend time writing stories and doing code review, but I don’t have to follow its (their — I have 3 of them) every move.
Something I'm thinking about and doing a bit of experimentation with is using LLMs to write specialist higher level code.
Rather than ask them to write web-apps in webby languages with open source frameworks etc, providing a very fixed, on-rails development process where everything is abstracted away. Accept that it'll be less powerful, but take the trade-off that it'll hopefully be faster and produce much more controllable software.
Concrete example, why do we let the LLM choose a database, schema, migration procedure, library, etc. We could decide to only support one database, enforce schema design (such as every table containing access control), enforce a migration process, enforce a library, even do schema design in a fixed config file rather than arbitrary DDL. Same for auth, deployments, even UI.
I created a small PI extension that always watches relevant directories and answers me in place, without switching context, or using a chat interface. Still experimenting but I like it.
i would investigate how claude code and codex work and suggest to build your own. it is not as hard to do as it seems (its not easy still, the prompting specifically). it can show u how workflows, skills, memory, plans etc. work so you can experiment for yourself to implement the workflow that suits _you_.
its an interesting excersize, for me i started with a simple repl to call models through model adapters, then allow them to list directories and read files within a chroot, build up slowly to also write access to files, then look at whats out there and try to build stuff you like from it.
the prompts are hard and there are some weird issues u will hit that will also help u understand certain fundamental limits etc. - understanding those can help also understand why some things dont work as hoped just yet.
for example, i had a real headache trying to make interactive specialized identities within workflows, so each stage is handled by specialized identites which have specific tools and focused context etc. theres a lot of hallucination too so u gotta have a lot more model cals, maybe do consensus between models etc. adversarial identities to review outputs before applying etc. All the stuff you still end up doing yourself again despite having programmed / prompted it all in...
initially it was all one context and identities struggled to remember what part of the process they would do, what tools they had vs what tool outputs to expect from previous stages etc. (it was funny but a big mess)
i use codex now, its closest to what i want, i couldnt get it better myself. claude wants to do too much and 'complete' stuff to much for me..
there are people blogging about loop programming, i did not investigate it thoroughly yet but id expect for myself id have similar results as my previous endevour.
edit: wanted to add, my motivation as claude dumps a lot of text back, i was using it back then. i wanted to give my models part of the screen as 'surface' to pin images, charts, and text etc on there, this worked nicely but i could not get them to do it really organically (prompting issues).
i thought i would be cool if the model could be like hey human, this thing we keep on screen while we discuss / design, like an architecture diagram. went to vulkan / glfw3 and rendering a terminal in there to get good enough pixel accurate graphics for presentation, that worked well and claude built it really easily.
My flow state with AI is having 5 different conversations at the same time making good progress on all of them by giving key insight and feedback at the right times.
You can actually go super fast with the right setup and focusing only on the important details like ensuring the shape of the APIs make sense and that test quality is good.
I'm currently rolling out Matt Pocock's Sandcastle project so that I can have those brakes removed. What will be left is just the grilling(/wayfinding).
My current flow heavily relies on Matt Pocock's Skills and Sandcastle project. I find them highly valuable in practice: grilling(/wayfind) into a spec and extract issues. Those live in Linear projects. I'm pointing my Sandcastle set-up at such Linear projects (or loose issues), which results in an MR.
Currently at the point of self-improving the prompts and Sandcastle set-up with a retrospective pass of the logs.
Just want to add:
I'm trying to do the same amount of work faster, not do work in parallel or agent orchestration. I'm not against letting the model go off and do things on it's own, that has its time and place.
But if I can do something in 15 minutes instead of 1 hour without the annoying prompt response loop, without the feeling that there could be blind spots, and while keeping all of the context (or at least most) in my head. That's a bigger win than spinning up 5 agents to do different things.
I'm in the middle of it so don't have any conclusions for you, but I started mucking with building my own cli coding app and there are _tons_ of levers available that aren't apparent from claude code or codex.
Including altering the turn concept. I think it is still ultimately call and response but instead of everything is a quarter note you can get a little closer to a beat you like.
The tab model was a lot of fun, you felt like you're getting a speed boost while coding. I think vibe coding (or agentic engineering) is a different paradigm altogether.
I have tried out some of the popular tools and I'm using opencode on desktop and I use pi via termux on android for when I'm on the go. I think the current direction of PRD -> review -> execute -> debug is in many cases the right mindset.
Working with a team of fresh graduates, I see that working with any vibe coding tool is like being a manager, not a developer. I think that's what you miss, you miss being a developer but the vibe coding tools make you a manager which isn't something that you might enjoy.
Nonetheless, I do think that there are some interesting things to do with pi. I'm just getting started, if anyone has an interesting workflow in pi, I would be interested in trying it out!
I am currently in the process of launching my AI teams platform that I've been working on since at least January. It's https://PersonaStack.ai. I'm doing it without VC money and all by myself. I've used over 110B tokens so far building it.
You get some amazing results with teams of AIs if you do it right. The key is to control behavior with what integrations and responsibilities each agent has. That way they naturally adapt, delegate, fact check each other, and generally act more autonomously.
This is already running the automated news site ainews.personastack.ai complete with social media posts 100% automated.
It also runs the issue triage, coding, reviews, and releases for the Kuberhealthy open source CNCF project, which is another thing of mine.
I don't think the next step is really smarter models. It's how we make the models more effective, and teams, when done right, net the best results I've seen.
Hoping to get noticed here soon, but it's extremely hard to do solo I'm finding.
Just yesterday I tried to find an annoying and persistent bug in the cummunication between a Lyrion Media Server and my player. I used Opencode's native Big Pickle AI, and first it was a pain in the back, because it gave me a new code, I had to start the player and test the control in the server's web GUI, report the errors back, and so forth, and it tried a lot, but never found the real cause.
Then I got tired, and told it to use PlayWright to control the browser and test by itself. After some hangs, that I had to stop manually, it did all by itself, and finally fixed the bug. I had to increase the agents' steps setting in the config, but that was it. While it was fixing the bug, I surfed the web, and kept an eye on it, but it did everything on it's own. impressive.
I'm using what I call "hermetic agents", where completely sandboxed agents write code and tests from the same specification, where the code writer can't see the test and the test writer can't see the code. The idea is that we can get better quality this way (by avoiding confirmation bias between code and tests). It is more painful to set up however, since you have to distill a spec and guides that the agent would normally hook into using RAG.
It is more like people (agent?) management than coding though. I'm setting up and debugging processes, rather than writing code. I spend a lot of time cursing at and arguing with the agents I'm using to set up hermetic agents (who I can't argue with obviously, but I can have conventional agents go over their logs to figure out how to improve their sandboxed-context).