logoalt Hacker News

aleqsyesterday at 8:04 PM3 repliesview on HN

I'm working an an agentic graph-based workflow execution engine/framework. The concept of the harness is completely abstracted away/generified - a 'node/agent's is a harness (cc, codex, open code, pi, etc) + model (I test different model and harness combinations). I have a set of tasks from trivial to complex - a set workflows (a workflow is a set of initial nodes and their behaviour) is defined and each one is asked to perform each task (multiplied by each harness/model combination roughly). The workflow can include agents/nodes which are able to modify the workflow graph and create nodes. Other nodes can break down tasks and send subtasks to other nodes. Mostly experimental stage at this point. I'm exploring/tracking metrics such as total wall clock time to complete a task, total cost in tokens and $, among others. This gives me a decent amount of data/insight into the abilities/performance of different harness/agents/models for different tasks, and gives me a great testing/dogfooding of my own harness (which is one of the harnesses being tested, and as of now the most efficient one).

The main bottleneck at this point is the cost of all of the tokens in the fairly large test matrix of tasks, harnesses, models.

I hope to release/open source all of this stuff eventually.


Replies

weitendorfyesterday at 9:11 PM

My company tried to build something like this pre-TUI as a tool-AI-IO dag dispatcher. The biggest mistake I made was thinking that people would have no problem figuring out how they could translate their work or define multi-step automations, and focusing on the orchestration and sandboxing thinking that was the core, when it was really figuring out how to get the onboarding UX/complexity to not feel daunting or more trouble than it was worth.

Eventually for my own work, I discovered that the context management and runtime was more like a stream or active service mesh than a dispatching / one-off processing problem, most others' were too. Then all my prompts would degrade across model versions or providers, and I realized that actually setting the context for the tasks and keeping track of it all was a ton of work and something I had to do everytime as an actual user, but never when I was testing or demoing it on existing data.

Curious how you're testing your work and if you've managed to avoid the problems I ran into. I need to permute across the same set of workloads/configs you mention (and maybe more) for my next set of work so I'd be very interested in sharing or collaborating on the test infrastructure! At Google I did a lot of permutation testing using https://github.com/cloudprober/cloudprober and was going to start using it sometime in the next couple weeks. It exists basically one layer above the workload content/targets so it's probably compatible with everything except the test client/driver you're using.

show 1 reply
fractorialyesterday at 9:33 PM

I rolled my own simple execution DAG program.

It’s shockingly effective due to rooting sub-DAGs into Planner nodes which are the only mutators of the DAG. The deepest topological leaf nodes become the blockers to the next Planner node.

The only other special node is a Human node; structurally impossible for agents to close (I rolled my own harness) and block on my attention.

show 1 reply
worikyesterday at 8:45 PM

> The main bottleneck at this point is the cost of all of the token

Are you using Chinese models? Quite a bit cheaper, but maybe still too expensive?

show 1 reply