By what means did you make sure your LLM was not trained with data from the original source code?

airza • yesterday at 12:31 PM • 2 replies • view on HN

Replies

Exactly - it very likely was trained on it. I tried this with Opus 4.6. I turned off web searches and other tool calls, and asked it to list some filenames it remembers being in the 7-zip repo. It got dozens exactly right and only two incorrect (they were close but not exact matches). I then asked it to give me the source code of a function I picked randomly, and it got the signature spot on, but not the contents.

My understanding of cleanroom is that the person/team programming is supposed to have never seen any of the original code. The agent is more like someone who has read the original code line by line, but doesn't remember all the details - and isn't allowed to check.

pmarreck • yesterday at 9:28 PM

Because it’s written in an entirely different language, which makes this whole point moot

➕ show 1 reply

alt Hacker News

Replies