logoalt Hacker News

AndyNemmitytoday at 4:01 PM2 repliesview on HN

Define obviously validation? What is the signal that tells you one is reasonable vs another?

I find the only way to do that is to look at it, if it passes some visual tests, try it, and then a/b test if it's any better than without it.


Replies

theptiptoday at 4:30 PM

Some sort of eval. Eg TermBench, implemented in Harbor.

It’s an insane amount of effort to build shareable, reusable, comprehensive evals, hence why so almost all skills are stuck in the “vibes” phase.

That said I think it’s quite easy to skim/intuit these sort of skills and do horizontal gene transfer into your own vibes-based system. If you use the skills regularly you can construct a cheap personal eval that is a lot easier to maintain and use it to compare a new skill/plugin. Just things like “please write a paper on <my personal unpublished thesis>” is a good starting point here. You get a good feel for whether a skill is better than vanilla by running it a couple times and watching the failure modes.

apwheeletoday at 4:23 PM

So yes a/b broadly speaking is what I was saying (test cases and can show it is actually better).

Even this repo just the "b" showcase, showing the outputs as is (with no clear documentation how those were generated, is it headless in a CI pipeline somewhere?), is not good, https://github.com/Imbad0202/academic-research-skills/tree/m....

show 1 reply