logoalt Hacker News

weitendorfyesterday at 9:11 PM1 replyview on HN

My company tried to build something like this pre-TUI as a tool-AI-IO dag dispatcher. The biggest mistake I made was thinking that people would have no problem figuring out how they could translate their work or define multi-step automations, and focusing on the orchestration and sandboxing thinking that was the core, when it was really figuring out how to get the onboarding UX/complexity to not feel daunting or more trouble than it was worth.

Eventually for my own work, I discovered that the context management and runtime was more like a stream or active service mesh than a dispatching / one-off processing problem, most others' were too. Then all my prompts would degrade across model versions or providers, and I realized that actually setting the context for the tasks and keeping track of it all was a ton of work and something I had to do everytime as an actual user, but never when I was testing or demoing it on existing data.

Curious how you're testing your work and if you've managed to avoid the problems I ran into. I need to permute across the same set of workloads/configs you mention (and maybe more) for my next set of work so I'd be very interested in sharing or collaborating on the test infrastructure! At Google I did a lot of permutation testing using https://github.com/cloudprober/cloudprober and was going to start using it sometime in the next couple weeks. It exists basically one layer above the workload content/targets so it's probably compatible with everything except the test client/driver you're using.


Replies

aleqsyesterday at 10:08 PM

I'm my case a workflow is basically an active/living graph of nodes/sub-tasks. One node can process a task (with all relevant context) and create multiple fan-out tasks, or it can add additional context/requirements and pass it along to another node. The message/task passing is all implemented as queue - nodes subscribe to messages/tasks addressed to them and execute them, producing more tasks (or zero new tasks). For each task there is a context and a parent task/context, as well as a key/value store of all tasks and their context. Each agent/node gets instructions injected into their prompts that tell them how to look up parents tasks/context as well as how to output new tasks.

There is also a feedback loop - a node can fail to process a task, and pass the reasoning/context for that back to the parent or another node - this might result in a new adjusted task replacing the failed task, or it might require human intervention.

show 1 reply