logoalt Hacker News

kelseyfrogyesterday at 10:16 PM1 replyview on HN

To recap, the original statement was, "Llm's do not verbatim disgorge chunks of the code they were trained on." We obviously both disagree with it.

While you keep trying to drag this toward an upper bound, I'm trying to illustrate that a coin with "//" reproduces a chunk of code. Again. I don't see much of a disagreement on that point either. What I continue to fail to elicit from you is the salient difference between the two.

I'm trying to find a scissor that distills your vibes into a consistent rule and each time it's the rebutted like I'm trying to make an argument. If your system doesn't have consistency, just say so.


Replies

Dylan16807today at 12:57 AM

I have a consistent rule. The rule is that if an LLM meets the threshold I set then it definitely violated copyright, and if it doesn't meet the threshold then we need more investigation.

We have proof of LLMs going over the threshold. So that answers the question.

Your illustrations are all in the "needs more investigation" area and they don't affect the conclusion.

We both agree that 1 token by itself is fine, and that some number is too many.

So why do you keep asking about that, as if it makes my argument inconsistent in some way? We both say the same thing!

We don't need to know the exact cutoff, or calculate how it varies. We only need to find violators that are over the cutoff.

How about you tell me what you want me to say? Do you want me to say my system is inconsistent? It's not. Having an area where the answer is unclear means the system is not able to answer every question, but it doesn't need to answer every question.

If you're accusing me of using "vibes" in a way that ruins things, then I counter that no I give nice specific and super-rare probabilities that are no more "vibes" based than your suggestion of an entire repo.

> What I continue to fail to elicit from you is the salient difference between the two.

Between what, "//" and the threshold I said?

The salient difference between the two is that one is too short to be copyright infringement and the other is so long and specific that it's definitely copyright infringement (when the source is an existing file under copyright without permission to copy). What more do you want?

Just like 1 grain of sand is definitely not a heap and 1kg of sand is definitely a heap.

If you ask me about 2, 3, 20 tokens my answer is I don't care and it doesn't matter and don't pretend it's relevant to the question of whether LLMs have been infringing copyright or not ("verbatim disgorge chunks").