logoalt Hacker News

Dylan16807yesterday at 5:47 AM2 repliesview on HN

I care that it's within the ballpark I spent considerable detail explaining. I don't care where inside the ballpark it is.

You gave an exaggerated upper limit, so extreme there's no ambiguity, of "entire repo".

I gave my own exaggerated upper limit, so extreme there's no ambiguity. And mine has examples of it actually happening. Incidents so extreme they're clear violations.

Maybe an analogy will help: The point at which a collection of sand grains becomes a heap is ambiguous. But when we have documented incidents involving a kilogram or more of sand in a conical shape, we can skip refining the threshold and simply declare that yes heaps are real. Incidents of major LLMs copying code, in a way that is full-on memorization and not just recreating things via chance and general code knowledge, are real.

You're the only person I've seen ever imply that true copying incidents are a statistical illusion, akin to a random die. Normally the debate is over how often and impactful they are, who is going to be held responsible, and what to do about them.


Replies

kelseyfrogyesterday at 10:16 PM

To recap, the original statement was, "Llm's do not verbatim disgorge chunks of the code they were trained on." We obviously both disagree with it.

While you keep trying to drag this toward an upper bound, I'm trying to illustrate that a coin with "//" reproduces a chunk of code. Again. I don't see much of a disagreement on that point either. What I continue to fail to elicit from you is the salient difference between the two.

I'm trying to find a scissor that distills your vibes into a consistent rule and each time it's the rebutted like I'm trying to make an argument. If your system doesn't have consistency, just say so.

show 1 reply