> Maybe compare to a more realistic baseline for the DIY side for more compelling benchmarketing?...

sachiniyer01 • today at 2:21 PM • 1 reply • view on HN

> Maybe compare to a more realistic baseline for the DIY side for more compelling benchmarketing?

This is fair critique. However, I don’t really trust myself to write a great code review skill for vLLM or OpenClaw. I also don’t think Claude Code is the right harness for this deep and broad scanning work. We find that it struggles to maintain clarity when considering many different bugs at the same time. The coding agents seem really great at single-goal tasks that they can Ralph their way to.

> We started with a DIY code review skill because it's inherent to want to customize to our codebase and infra before trying solutions that add layers which may get in our way here.

Being able to tinker deeply with the tools is pretty inherent to my love of dev tools in general. Our job is to make use of all of those customizations (our agent will use that 1 page skill when doing its bug finding). I also still think externalizing part of your dev workflow is the right way to get ahead. You really don’t want to do the work of eval-ing/maintaining that skill to make sure it still performs well with a mythos or something.

> and more on token efficiency.

I’m really confident in our ability to stretch $20 of tokens ;)

Replies

lmeyerov • today at 4:29 PM

> I also don’t think Claude Code is the right harness for this deep and broad scanning work. We find that it struggles to maintain clarity when considering many different bugs at the same time.

The skill we do splits it into multiple passes to divide and conquer on task dimension and on files for that exact reason. Likewise, it loops (Ralph-ish) until it converges. It maintains a task queue and work log to stay on track. We are growing it over time, but now more about per-repo customization, while the bones are good cross-repo.

I would only trust frontier-grade harnesses to do this kind of skill run, and guilty-until-proven-innocent various harnesses x prompt combos because of that.

My point isn't that our 1 page skill eliminates the need for your startup, but that is a normal flow for more serious ai-augmented coders so you are picking a blatantly known-bad starting point for serious coders. That makes it unclear what value your tool brings and calls into question why you are refusing to measure yourself in a post about measurement.

alt Hacker News

Replies