It's a good question - is blackbox hacking as effective as whitebox hacking, for AI agents? I&#...

shay_ker • today at 4:27 PM • 1 reply • view on HN

It's a good question - is blackbox hacking as effective as whitebox hacking, for AI agents? I've gotta assume someone at Anthropic is putting together an eval as we speak.

Replies

hansvm • today at 4:59 PM

I don't really know, but I have a story which might prompt some conversation about it.

At $WORK we had a system which, if you traced its logic, could not possibly experience the bug we were seeing in production. This was a userspace control module for an FPGA driver connected to some machinery you really don't want to fuck around with, and the bug had wasted something like three staff+ engineer-years by the time I got there.

Recognizing that the bug was impossible in the userspace code if the system worked as intended end-to-end, the engineers started diving into verilog and driver code, trying to find the issue. People were suspecting miscompilations and all kinds of fun things.

Eventually, for unrelated reasons, I decided to clean up the userspace code (deleting and refactoring things unlocks additional deletion and refactoring opportunities, and all said and done I deleted 80% of the project so that I had a better foundation for some features I had to add).

For one of those improvements, my observation was just that if I had to write the driver code to support the concurrency we were abusing I'd be swearing up a storm and trying to find any way I could to solve a simpler problem instead.

Long story short, I still don't know what the driver bug was, but the actual authors must've felt the same way, since when I opted for userspace code with simpler concurrency demands the bug disappeared.

Tying it back to AI and hacking, the white box approach here literally didn't work, and the black box approach easily illuminated that something was probably fucky. Given that AI can de-minify and otherwise spot patterns from fairly limited data, I wouldn't be shocked if black-box hacking were (at least sometimes) more token-efficient than white-box.

➕ show 1 reply

alt Hacker News

Replies