logoalt Hacker News

Retr0idyesterday at 7:18 PM2 repliesview on HN

Super cool!

A related situation I was in recently was where I was trying to bisect a perf regression, but the benchmarks themselves were quite noisy, making it hard to tell whether I was looking at a "good" vs "bad" commit without repeated trials (in practice I just did repeats).

I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step?


Replies

furyofantarestoday at 2:02 AM

I have this same issue a lot.

I vibe up a lot of really simple casual games, which should have very minimal demands, and the LLM-agent introduces bad things a lot that don't present right away. Either it takes multiple bad things to notice, or it doesn't really affect anything on a dev machine but is horrible on wasm+mobile builds, or I just don't notice right away.

This is all really hard to track down, there's noise in the heuristics, and I don't know if I'm looking for one really dumb thing or a bunch of small things that have happened over time.

ajbyesterday at 9:41 PM

At a guess, you can reuse the entropy part, but you'd need to plug in a new probability distribution.