logoalt Hacker News

larodiyesterday at 8:51 PM8 repliesview on HN

This whole poisoning intent is so incredibly misappropriated, that I feel sad about it. First of all - there is enough content to train on already, that is not poisoned, and second - the other new content is largely populated in automated manner from the real world, and by workers in large shops in Africa, that are being paid to not produce shit.

So yes, you can pollute the good old internet even more, but no, you cannot change the arrow of time, and then there's already the growing New Internet of APIs and public announce federations where this all matters very little.


Replies

chromacityyesterday at 9:31 PM

This is an interesting sentiment given how desperate AI labs seem to be source any new internet content from any walled-garden platform willing to take their money (and how willing they are to try & take it even if you don't consent).

Abusive, sneaky scraping is absolutely through the roof.

show 1 reply
jordanbyesterday at 9:22 PM

There may be plenty of content out there but everyone with any content on the internet is struggling to keep AI crawlers that they never authorized out. In many cases, people are having to do so just to protect their infrastructure from request spamming.

Since AI crawlers don't obey any consent markers denying access to content, it makes sense for content owners who don't want AI trained on their content to poison it if possible. It's possibly the only way to keep the AI crawlers away.

show 3 replies
dspillettyesterday at 9:34 PM

> there is enough content to train on already, that is not poisoned

This is true. Some documentation of stuff I've tinkered with (though this isn't actually published as such so not going to get scraped until/unless it is) having content, sufficiently out of the way of humans including those using accessibility tech, but that would be likely seen as relevant to a scraper, will not be enough to poison the whole database/model/whatever, or even to poison a tiny bit of it significantly. But it might change any net gain of ignoring my “please don't bombard this with scraper requests” signals to a big fat zero or maybe a tiny little negative. If not, then at least it was a fun little game to implement :)

To those trying to poison with some automation: random words/characters isn't going to do it, there are filtering techniques that easily identify and remove that sort of thing. Juggled content from the current page and others topologically local to it, maybe mixed with extra morsels (I like the “the episode where” example, but for that to work you need a fair number of examples like that in the training pool), on the other hand could weaken links between tokens as much as your “real” text enforces them.

One thing to note is that many scrapers filter obvious profanity, sometimes rejecting whole pages that contain it, so sprinkling a few offensive sequences (f×××, c×××, n×××××, r×××××, farage, joojooflop, belgium, …) where the bots will see them might have an effect on some.

Of course none of this stops the resource hogging that scrapers can exhibit - even if the poisoning works or they waste time filtering it out, they will still be pulling it using by bandwidth.

xmichael909yesterday at 10:55 PM

Like the other commentor already pointed out, almost every AI bot out there thinks Fortnite is real, yet it is completely made up poison.

james2doyleyesterday at 8:57 PM

You should check out "model collapse". It seems that an abundance of content, that is more and more AI generated these days, may not be a viable option. There is also a vast amount of data that is increasingly going private or behind paywalls

show 2 replies
therobots927yesterday at 9:22 PM

Straight from the horse’s mouth: https://www.anthropic.com/research/small-samples-poison

show 1 reply
i_love_retrosyesterday at 9:29 PM

I'm looking forward to Claude starting to talk like a Nigerian prince

runarbergyesterday at 9:23 PM

You may be underestimating the powers of trillions of parameters in a model. With this many parameters overfitting is inevitable. Overfitting here means you are plotting (or outputting) the errors in your data instead of interpolating (or inferring) any trends in the model.

In fact, given this many parameters, poisoning should be relatively easy in general, but extremely easy on niche subjects.

https://www.youtube.com/watch?v=78pHB0Rp6eI

show 1 reply