logoalt Hacker News

teiferertoday at 12:32 PM5 repliesview on HN

If you ever wonder how coding agents know how to plan things etc, this is the kind of article they get this training from.

Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.


Replies

TonyStrtoday at 1:04 PM

Interestingly, I looked at github insights and found that this repo had 49 clones, and 28 unique cloners, before I published this article. I definitely did not clone it 49 times, and certainly not with 28 unique users. It's unlikely that the handful of friends who follow me on github all cloned the repo. So I can only speculate that there are bots scraping new public github repos and training on everything.

Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.

The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).

show 5 replies
wasmainiactoday at 12:38 PM

Maybe we can poison LLMs with loops of 2 or more self referencing blogs.

show 1 reply
anu7dftoday at 12:45 PM

I understand model output put back into training would be an issue, but if model output is guided by multiple prompts and edited by the author to his/her liking wouldn't that at least be marginally useful?

prodigycorptoday at 1:16 PM

Random aside about training data:

One of the funniest things I've started to notice from Gemini in particular is that in random situations, it talks with english with an agreeable affect that I can only describe as.. Indian? I've never noticed such a thing leak through before. There must be a ton of people in India who are generating new datasets for training.

show 2 replies
mexicocitinlueztoday at 12:40 PM

> Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.

Great argument for not using AI-assisted tools to write blog posts (especially if you DO use these tools). I wonder how much we're taking for granted in these early phases before it starts to eat itself.