logoalt Hacker News

airzayesterday at 12:31 PM2 repliesview on HN

By what means did you make sure your LLM was not trained with data from the original source code?


Replies

MrManateeyesterday at 5:13 PM

Exactly - it very likely was trained on it. I tried this with Opus 4.6. I turned off web searches and other tool calls, and asked it to list some filenames it remembers being in the 7-zip repo. It got dozens exactly right and only two incorrect (they were close but not exact matches). I then asked it to give me the source code of a function I picked randomly, and it got the signature spot on, but not the contents.

My understanding of cleanroom is that the person/team programming is supposed to have never seen any of the original code. The agent is more like someone who has read the original code line by line, but doesn't remember all the details - and isn't allowed to check.

pmarreckyesterday at 9:28 PM

Because it’s written in an entirely different language, which makes this whole point moot

show 1 reply