Seriously, I can't help but think this has to be part of the answer. Just something like ...

crazygringo • yesterday at 11:31 PM • 0 replies • view on HN

Seriously, I can't help but think this has to be part of the answer.

Just something like /llms.txt which contains a list of .txt or .txt.gz files or something?

Because the problem is that every site is going to have its own data dump format, often in complex XML or SQL or something.

LLM's don't need any of that metadata, and many sites might not want to provide it because e.g. Yelp doesn't want competitors scraping its list of restaurants.

But if it's intentionally limited to only paragraph-style text, and stripped entirely of URL's, ID's, addresses, phone numbers, etc. -- so e.g. a Yelp page would literally just be the cuisine category and reviews of each restaurant, no name, no city, no identifier or anything -- then it gives LLM's what they need much faster, the site doesn't need to be hammered, and it's not in a format for competitors to easily copy your content.

At most, maybe add markup for <item></item> to represent pages, products, restaurants, whatever the "main noun" is, and recursive <subitem></subitem> to represent e.g. reviews on a restaurant, comments on a review, comments one level deeper on a comment, etc. Maybe a couple more like <title> and <author>, but otherwise just pure text. As simple as possible.

The biggest problem is that a lot of sites will create a "dummy" llms.txt without most of the content because they don't care, so the scrapers will scrape anyways...

alt Hacker News