Unless this concept becomes a mass phenomenon with many implementations, isn’t this pretty easy to filter out? And furthermore, since this antagonizes billion-dollar companies that can spin up teams doing nothing but browse Github and HN for software like this to prevent polluting their datalakes, I wonder whether this is a very efficient approach.
It would be more efficient for them to spin up a team to study this robots.txt thing. They've ignored that low hanging fruit, so they won't do the more sophisticated thing any time soon.
I forget which fiction book covered this phenomenon ( Rainbow's End? ), but the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do ; they are not actively fighting against determined and possibly radicalized users.
The idea is that you place this in parallel to the rest of your website routes, that way your entire server might get blacklisted by the bot.
Does it need to be efficient if it’s easy? I wrote a similar tool except it’s not a performance tarpit. The goal is to slightly modify otherwise organic content so that it is wrong, but only for AI bots. If they catch on and stop crawling the site, nothing is lost. https://github.com/Fingel/django-llm-poison
But it's fun, right?
I am not sure. How would crawlers filter this?
It's not. It's rather pointless and frankly, nearsighted. And we can DDoS sites like this just as offensively as well simply by making many requests to it since its own docs say its Markov generation is computationally expensive, but it is NOT expensive for even 1 person to make many requests to it. Just expensive to host. So feel free to use this bash function to defeat these:
httpunch() {
local url=$1
local connections=${2:-${HTTPUNCH_CONNECTIONS:-100}}
local action=$1
local keepalive_time=${HTTPUNCH_KEEPALIVE:-60}
local silent_mode=false
# Check if "kill" was passed as the first argument
if [[ $action == "kill" ]]; then
echo "Killing all curl processes..."
pkill -f "curl --no-buffer"
return
fi
# Parse optional --silent argument
for arg in "$@"; do
if [[ $arg == "--silent" ]]; then
silent_mode=true
break
fi
done
# Ensure URL is provided if "kill" is not used
if [[ -z $url ]]; then
echo "Usage: httpunch [kill | <url>] [number_of_connections] [--silent]"
echo "Environment variables: HTTPUNCH_CONNECTIONS (default: 100), HTTPUNCH_KEEPALIVE (default: 60)."
return 1
fi
echo "Starting $connections connections to $url..."
for ((i = 1; i <= connections; i++)); do
if $silent_mode; then
curl --no-buffer --silent --output /dev/null --keepalive-time "$keepalive_time" "$url" &
else
curl --no-buffer --keepalive-time "$keepalive_time" "$url" &
fi
done
echo "$connections connections started with a keepalive time of $keepalive_time seconds."
echo "Use 'httpunch kill' to terminate them."
}
(Generated in a few seconds with the help of an LLM of course.) Your free speech is also my free speech. LLM's are just a very useful tool, and Llama for example is open-source and also needs to be trained on data. And I <opinion> just can't stand knee-jerk-anticorporate AI-doomers who decide to just create chaos instead of using that same energy to try to steer the progress </opinion>.If it means it makes your own content safe when you deploy it on a corner of your website: mission accomplished!
Author of a similar tool here[0]. There are a few implementations of this sort of thing that I know of. Mine is different in that the primary purpose is to slightly alter content statically using a Markov generator, mainly to make it useless for content reposters, secondarily to make it useless to LLM crawlers that ignore my robots.txt file[1]. I assume the generated text is bad enough that the LLM crawlers just throw the result out. Other than the extremely poor quality of the text, my tool doesn't leave any fingerprints (like recursive non-sense links.) In any case, it can be run on static sites with no server-side dependencies so long as you have a way to do content redirection based on User-Agent, IP, etc.
My tool does have a second component - linkmaze - which generates a bunch of nonsense text with a Markov generator, and serves infinite links (like Nepthenes does) but I generally only throw incorrigible bots at it (and, at others have noted in-thread, most crawlers already set some kind of limit on how many requests they'll send to a given site, especially a small site.) I do use it for PHP-exploit crawlers as well, though I've seen no evidence those fall into the maze -- I think they mostly just look for some string indicating a successful exploit and move on if whatever they're looking for isn't present.
But, for my use case, I don't really care if someone fingerprints content generated by my tool and avoids it. That's the point: I've set robots.txt to tell these people not to crawl my site.
In addition to Quixotic (my tool) and Napthenes, I know of:
* https://github.com/Fingel/django-llm-poison
* https://codeberg.org/MikeCoats/poison-the-wellms
* https://codeberg.org/timmc/marko/
0 - https://marcusb.org/hacks/quixotic.html
1 - I use the ai.robots.txt user agent list from https://github.com/ai-robots-txt/ai.robots.txt