I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy - something like https://www.example.com/cdn-cgi/cached-contents.json They already have the website content in their cache, so why not just cut out the middle man of scraping services and API's like this and publish it?
Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet.
> I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy
It's entirely possible that they're doing this under the hood for cases where they can clearly identify the content they have cached is public.
Not the same thing, but they have something close (it's not on-by-default, yet) [1]:
> Cloudflare's network now supports real-time content conversion at the source, for enabled zones using content negotiation headers. Now when AI systems request pages from any website that uses Cloudflare and has Markdown for Agents enabled, they can express the preference for text/markdown in the request. Our network will automatically and efficiently convert the HTML to markdown, when possible, on the fly.
That would prolly work for simple sites, but you still need the dedicated scraping service with a browser to render sites that are more complex (i.e. SPAs)
It’s a bit more complicated than that. This is their product Browser Rendering, which runs a real browser that loads the page and executes JavaScript. It’s a bit more involved than a simple curl scraping.
Well, the conversion process into the JSON representation is going to take CPU, and then you have to store the result, in essence doubling your cache footprint.
Doing it on demand still utilizes their cached version, so it saves a trip to the origin, but doesn’t require doubling the cache size. They can still cache the results if the same site is scraped multiple times, but this saves having to cache things that are never going to be requested.
Cache footprint management is a huge factor in the cost and performance for a CDN, you want to get the most out of your storage and you want to serve as many pages from cache as possible.
I know in my experience working for a CDN, we were doing all sorts of things to try to maximize the hit rate for our cache.. in fact, one of the easiest and most effective techniques for increasing cache hit rate is to do the OPPOSITE of what you are suggesting; instead of pre-caching content, you do ‘second hit caching’, where you only store a copy in the cache if a piece of content is requested a second time. The idea is that a lot of content is requested only once by one user, and then never again, so it is a waste to store it in the cache. If you wait until it is requested a second time before you cache it, you avoid those single use pages going into your cache, and don’t hurt overall performance that much, because the content that is most useful to cache is requested a lot, and you only have to make one extra origin request.