Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expe...

nikitaga • yesterday at 11:53 PM • 26 replies • view on HN

Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

The former relies on fairly controversial ideas about copyright and fair use to qualify as abuse, whereas the latter is direct financial damage – by your own direct competitors no less.

It's fun to poke at a seeming hypocrisy of the big bad, but the similarity in this case is quite superficial.

Replies

PunchyHamster • today at 1:29 AM

> Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.

I bet people being fucking DDOSed by AI bots disagree

Also the fucking ignorance assuming it's "static content" and not something needing code running

➕ show 5 replies

not2b • today at 1:10 AM

I understand why OpenAI is trying to reduce its costs, but it simply isn't true that AI crawlers aren't creating very significant load, especially those crawlers that ignore robots.txt and hide their identities. This is direct financial damage and it's particularly hard on nonprofit sites that have been around a long time.

➕ show 2 replies

lm411 • today at 3:18 AM

That is ridiculous.

You imply that "an expensive llm service" is harmed by abuse, but, every other service is not? Because their websites are "static" and "near-zero marginal cost"?

You have no clue what you are talking about.

➕ show 1 reply

cicko • today at 6:19 AM

Interesting how other people's cost is "near-zero marginal cost" while yours is "an expensive LLM service". Also, others' rights are "fairly controversial ideas about copyright and fair use" while yours is "direct financial damage". I like how you frame this.

sandeepkd • today at 2:36 AM

Lets not try to qualify the wrongs by picking a metric and evaluating just one side of it. A static website owner could be running with a very small budget and the scraping from bots can bring down their business too. The chances of a static website owner burning through their own life savings are probably higher.

➕ show 2 replies

AmbroseBierce • today at 4:49 AM

It's not like those models are expensive because the usefulness that they extracted from scraping others without permission right? You are not even scratching the surface of the hypocrisy

alsetmusic • today at 2:33 AM

Have you not seen the multiple posts that have reached the front page of HN with people taking self-hosted Git repos offline or having their personal blogs hammered to hell? Cause if you haven't, they definitely exist and get voted up by the community.

wolvoleo • today at 5:59 AM

It's more ironic because without all the scraping openai has done, there would have been no ChatGPT.

Also, it's not just the cost of the bandwidth and processing. Information has value too. Otherwise they wouldn't bother scraping it in the first place. They compete directly with the websites featuring their training data and thus they are taking away value from them just as the bots do from ChatGPT.

In fact the more I think of it, I think it's exactly the same thing.

➕ show 1 reply

VadimPR • today at 6:10 AM

Getting scraped by abusive bots who bring down the website because they overload the DB with unique queries is not marginal. I spent a good half of last year with extra layers of caching, CloudFlare, you name it because our little hobby website kept getting DDoS'd by the bots scraping the web for training data.

Never in 15 years if running the website did we have such issues, and you can be sure that cache layers were in place already for it to last this long.

the_sleaze_ • today at 4:14 AM

60% of our traffic is bot, on average. Sometimes almost 100%.

not_your_vase • today at 5:06 AM

  > net-zero marginal cost

Lol, you single-handedly created a market for Anubis, and in the past 3 years the cloudflare captchas have multiplied by at least 10-fold, now they are even on websites that were very vocal against it. Many websites are still drowning - gnu family regularly only accessible through wayback machine.

Spare me your tears.

razingeden • today at 1:25 AM

It is direct financial damage if my servers not on an unmetered connection — after years of bills coming in around $3/mo I got a surprise >$800 bill on a site nobody on earth appears to care about besides AI scrapers.

It hasn’t even been updated in years so hell if I know why it needs to be fetched constantly and aggressively, - but fuck every single one of these companies now whining about bots scraping and victimizing them, here’s my violin.

➕ show 1 reply

SkiFire13 • today at 5:47 AM

> Scraping static content

How do you know the content is static?

bakugo • yesterday at 11:55 PM

The cost is so marginal that many, many websites have been forced to add cloudflare captchas or PoW checks before letting anyone access them, because the server would slow to a crawl from 1000 scrapers hitting it at once otherwise.

gmerc • today at 7:04 AM

It’s not for techbros to decide at what threshold of theft it’s actually theft. “My GPU time is more valuable than your CPU time” isn’t a thing and Wikipedias latest numbers on scraping show that marginal costs at scale are a valid concern

heyethan • today at 2:41 AM

I think this also explains why the checks are moving up the stack.

If the real cost is in actually running the app or the model, then just verifying a browser isn’t enough anymore. You need to verify that the expensive part actually happened.

Otherwise you’re basically protecting the cheapest layer while the expensive one is still exposed.

swagmoney1606 • today at 2:09 AM

And yet I have to pay in my time and cash to handle the constant ddos'es from the constant LLM scraping

make3 • today at 5:10 AM

Absolutely not, the former relies on controversial ideas to qualify as legal.

Stealing the content from the whole planet & actively reducing the incentive to visit the sites without financial restitution is pretty bad.

nozzlegear • today at 2:35 AM

Are they, actually?

platybubsy • today at 7:04 AM

Bait or genuine techbro? Hard to say

AtlasBarfed • today at 12:59 AM

Because you say it is?

I obviously disagree. I mean, on top of this we are talking about not-open OpenAI.

9864247888754 • today at 8:05 AM

[dead]

karlshea • today at 12:59 AM

I don’t know what world you live in but it’s not this one.

nslsm • yesterday at 11:56 PM

The issue is that there are so many awful webmasters that have websites that take hundreds of milliseconds to generate and are brought down by a couple requests a second.

➕ show 1 reply

alt Hacker News

Replies