logoalt Hacker News

tensegristyesterday at 10:37 PM7 repliesview on HN

the more time passes the more i'm convinced that the solution is to—somehow—force everyone to have to go through something like common crawl

i don't want people's servers to be pegged at 100% because a stupid dfs scraper is exhaustively traversing their search facets, but i also want the web to remain scrapable by ordinary people, or rather go back to how readily scrapable it used to be before the invention of cloudflare

as a middle ground, perhaps we could agree on a new /.well-known/ path meant to contain links to timestamped data dumps?


Replies

nostrademonsyesterday at 10:40 PM

That's sorta what MetaBrainz did - they offer their whole DB as a single tarball dump, much like what Wikipedia does. I downloaded it in the order of an hour; if I need a MusicBrainz lookup, I just do a local query.

For this strategy to work, people need to actually use the DB dumps instead of just defaulting to scraping. Unfortunately scraping is trivially easy, particularly now that AI code assistants can write a working scraper in ~5-10 minutes.

show 2 replies
tpmoneyyesterday at 10:55 PM

I'll propose my pie in the sky plan here again. We should overhaul the copyright system completely in light of AI and make it mostly win-win for everyone. This is predicated on the idea that the NIST numbers set is sort of the "hello world" dataset for people wanting to learn machine vision and having that common data set is really handy. Numbers made up off the top of my head/subject to tuning but the basic idea is this:

1) Cut copyright to 15-20 years by default. You can have 1 extension of an additional 10-15 years if you submit your work to the "National Data Set" within say 2-3 years of the initial publication.

2) Content in the National set is well categorized and cleaned up. It's the cleanest data set anyone could want. The data set is used both to train some public models and also licensed out to people wanting to train their own models. Both the public models and the data sets are licensed for nominal fees.

3) People who use the public models or data sets as part of their AI system are granted immunity from copyright violation claims for content generated by these models, modulo some exceptions for knowing and intentional violations (e.g. generating the contents of a book into an epub). People who choose to scrape their own data are subject to the current state of the law with regards to both scraping and use (so you probably better be buying a lot of books).

4) The license fees generated from licensing the data and the models would be split into royalty payments to people whose works are in the dataset, and are still under copyright protection, proportional to the amount of data submitted and inversely proportional to the age of that data. There would be some absolute caps in place to prevent slamming the national data sets with junk data just to pump the numbers.

Everyone gets something out of this. AI folks get clean data, that they didn't have to burn a lot of resources scraping. Copyright holders get paid for their works used by AI and retain most of the protections they have today, just for a shorter time), the public gets usable AI tooling without everyone spending their own resources on building their own data sets, site owners and the like get reduced bot/scraping traffic. It's not perfect, and I'm sure the devil is in the details, but that's the nature of this sort of thing.

show 1 reply
Imustaskforhelpyesterday at 10:41 PM

If someone wants to scrape. I mean not levels of complete internet similar to how google does but at a niche level (like you got a forum you wish to scrape)

I like to create tampermonkey scripts regarding these. They are like more lightweight/easier way to build extensions mostly imo

Now I don't like AI but I don't know anything about scraping so I used AI to generate the scraping code and paste it in tampermonkey and let it run

I recently used this for where I effectively scraped a website which had list of vps servers and their prices and I built myself a list of that to analyze as an example

Also I have to say this that I usually try to look out for databases so much so that on a similar website like this related to something, I contacted them about db but no response, their db of server prices were private and only showed lowest

So I picked the other website and did this. I also scraped all headlines of lowendtalk ever with their links for semi purposes of archival and semi purposes of scraping the headlines and parsing it to LLM to find a list of vps providers as well

crazygringoyesterday at 11:31 PM

Seriously, I can't help but think this has to be part of the answer.

Just something like /llms.txt which contains a list of .txt or .txt.gz files or something?

Because the problem is that every site is going to have its own data dump format, often in complex XML or SQL or something.

LLM's don't need any of that metadata, and many sites might not want to provide it because e.g. Yelp doesn't want competitors scraping its list of restaurants.

But if it's intentionally limited to only paragraph-style text, and stripped entirely of URL's, ID's, addresses, phone numbers, etc. -- so e.g. a Yelp page would literally just be the cuisine category and reviews of each restaurant, no name, no city, no identifier or anything -- then it gives LLM's what they need much faster, the site doesn't need to be hammered, and it's not in a format for competitors to easily copy your content.

At most, maybe add markup for <item></item> to represent pages, products, restaurants, whatever the "main noun" is, and recursive <subitem></subitem> to represent e.g. reviews on a restaurant, comments on a review, comments one level deeper on a comment, etc. Maybe a couple more like <title> and <author>, but otherwise just pure text. As simple as possible.

The biggest problem is that a lot of sites will create a "dummy" llms.txt without most of the content because they don't care, so the scrapers will scrape anyways...

themafiayesterday at 11:26 PM

It's not a technical problem you are facing.

It's a monetary one, specifically, large pools of sequestered wealth making extremely bad long term investments all in a single dubious technical area.

Any new phenomenon driven by this process will have the same deleterious results on the rest of computing. There is a market value in ruining your website that's too high for the fruit grabbers to ignore.

In time adaptations will arise. The apparently desired technical future is not inevitable.

fartfeaturesyesterday at 10:49 PM

Good idea and perhaps a standard that means we only have to grab deltas or some sort of etag based give me all the database dumps after the one I have (or if something has changed).

nikanjyesterday at 10:42 PM

And then YC funds a startup who plans to leapfrog the competition by doing their own scrape instead using the standard data everyone else has

show 1 reply