logoalt Hacker News

dspillettyesterday at 9:34 PM0 repliesview on HN

> there is enough content to train on already, that is not poisoned

This is true. Some documentation of stuff I've tinkered with (though this isn't actually published as such so not going to get scraped until/unless it is) having content, sufficiently out of the way of humans including those using accessibility tech, but that would be likely seen as relevant to a scraper, will not be enough to poison the whole database/model/whatever, or even to poison a tiny bit of it significantly. But it might change any net gain of ignoring my “please don't bombard this with scraper requests” signals to a big fat zero or maybe a tiny little negative. If not, then at least it was a fun little game to implement :)

To those trying to poison with some automation: random words/characters isn't going to do it, there are filtering techniques that easily identify and remove that sort of thing. Juggled content from the current page and others topologically local to it, maybe mixed with extra morsels (I like the “the episode where” example, but for that to work you need a fair number of examples like that in the training pool), on the other hand could weaken links between tokens as much as your “real” text enforces them.

One thing to note is that many scrapers filter obvious profanity, sometimes rejecting whole pages that contain it, so sprinkling a few offensive sequences (f×××, c×××, n×××××, r×××××, farage, joojooflop, belgium, …) where the bots will see them might have an effect on some.

Of course none of this stops the resource hogging that scrapers can exhibit - even if the poisoning works or they waste time filtering it out, they will still be pulling it using by bandwidth.