logoalt Hacker News

sanquitoday at 5:56 PM2 repliesview on HN

Cool concept. I would like to see this combined with mitmproxy for archive grade fidelity. You could be saving exactly the data served and at the same time a representation by a modern (contemporary) browser, with all JS having run. This combination would be my perfect replacement for the WARC format.


Replies

tamndtoday at 6:00 PM

I'm working on WARC too, with format from Common Crawl!

By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli

show 1 reply
Dhavidhtoday at 6:01 PM

sound interesting