logoalt Hacker News

capitainenemo10/04/20241 replyview on HN

So, I have a few objections here. First off, CAPTCHAs are not "by definition" about fingerprinting users. They are "by definition" a turing test for distinguishing humans from bots. It just turns out that is hard to do, so CAPTCHAs pivoted to fingerprinting instead. Secondly, sites often are unaware or not given the choice. Businesses are sold the idea that they are being protected against bots, when in fact they are turning away real users. Many I contacted were unaware this was happening. In fact, the servers in between are not even integrated in a way to support a reasonable fallback. For example, on some sites (FedEx, Kickstarter) the "captcha" is returned by a JSON API that is completely unable to handle it or present it to the user. Thirdly, the fingerprinting is broadly applied with NO exceptions. You would think a simple heuristic would be "the user has used this IP for the past 5 years to authenticate to this website, with the same browser UA - we can probably let them through" but, no, they kick it over to a third party automated system, one that can completely break authentication, to fingerprint their users, on pages with personal information at that. They often don't offer any other options either, like additional auth challenges.

So, yeah, people are being told "well, we have to fingerprint users, we have no choice" and the ironic thing is the battle is being lost anyway, and real damage is being done to in the false positives, esp if the site is tech savvy.

But whatever. I'm aware I won't convince you, I'm aware I'm in the minority, most people are accept the status quo, or are unaware of the abuses, but it's being implemented poorly, it isn't working, it's harming real people and the internet as a whole, and it is not an adequate fix.


Replies

imiric10/05/2024

Hey, thanks for taking the time to write such a thoughtful reply. I'm always open to counterarguments to what I'm saying, and happy to discuss them in a civil manner. I think such discussions are healthy, even without the expectation that we're going to convince one another.

I think our main disagreement is about what constitutes a "fingerprint", and whether CAPTCHAs can work without it.

Let's start from basic principles...

The "Turing test" in the CAPTCHA acronym is merely a vague historical descriptor of what these tools actually do. For one, the arbitrer in the original Turing test was a human. In contrast, the "Completely Automated" part means that the arbitrer in CAPTCHAs has to be a machine.

Secondly, the original Turing test involved a natural language conversation. This would be highly impractical in the context of web applications, and would also go against the "Completely Automated" part.

Furthermore, humans can be easily fooled by machines in such a test nowadays, as the original Turing test has been decidedly broken with recent AI advancements.

So taking all of this into account, since machines don't have reasoning capabilities (yet) to make the bot-or-not distinction in the same way that a human would, we have to instead provide them with inputs that they can actually process. This inevitably means that the more information we can gather about the user, the higher the accuracy of their predictions will be.

This is why I say that CAPTCHAs have to involve fingerprints _by definition_. They wouldn't be able to do their job otherwise.

Can we agree on this so far?

Now let's define what a fingerprint actually is. It's just a collection of data points about the user. In your example, the IP address and user agent are a couple of data points. The question is: are just these two alone enough information for a CAPTCHA to accurately do its job? The IP address can be shared by many users, and can be dynamic. The user agent can be easily spoofed, and is not reliable. So I think we can agree that the answer to that question is "no".

This means that we need much more information for a CAPTCHA to work. This is where device information, advanced heuristics and behavioral signals come into play. Is the user interacting with the page? How human-like are their interactions? Are there patterns in this activity that we've seen before? What device are they using (or claim to be using)? Can we detect a browser automation tool being used? All of these, and many more, data points go into making an accurate bot-or-not decision. We can't rely on any single data point in isolation, but all of them in combination gives us a better picture.

Now, this inevitably becomes a very accurate "fingerprint" of the user. Advertisers would love to get ahold of this data, and use it for tracking and targeting purposes. The difference is in how it is used. A privacy-conscious CAPTCHA implementation that follows regulations like the GDPR would treat this data as a liability rather than an asset. The data wouldn't be shared with anyone, and would be purged after it's not needed.

The other point I'd like to emphasize is that the internet is becoming more difficult and dangerous to use by humans. We're being overrun with bots. As I linked in my previous reply, an estimated 36% of all global traffic comes from bots. This is an insane statistic, which will only grow as AI becomes more accessible.

So all of this is to say that we need automated ways to tell humans and computers apart to make the internet safer and actually usable by humans, and CAPTCHAs are so far the best system we have for it. They're far from being perfect, and I doubt we'll ever reach that point. Can we do a better job at it? Absolutely. But the alternative of not using them is much, much worse. If you can think of a better way of solving these problems without CAPTCHAs, I'm all ears.

The examples you mention are logistical and systemic problems in organizations. Businesses need to be more aware of these issues, and how to best address them. They're not indicators of problems with CAPTCHAs themselves, but with how they're used and configured in organizations.

Sorry for the wall of text, but I hope I clarified some of my thoughts on this, and that we can find a middle ground somewhere. :) Cheers!

Another point I forgot to mention: it's certainly possible to not gather all these signals. We can present an actual puzzle to the user, confirm whether they solve it correctly, and use signals only from the interaction with the puzzle itself. There are two problems with this: it's incredibly annoying and disruptive to actual humans. Nobody wants to solve puzzles to access some content. This is also far from being a "Completely Automated" test... And the other problem is that machines have become increasingly good at solving these puzzles themselves. The standard image macro puzzle has been broken for many years. Object and audio recognition is now broken as well. You see some CAPTCHA implementations coming up with more creative puzzles, but these will all inevitably be broken as well. So puzzles are just not a user friendly or reliable way of doing bot detection.

show 1 reply