logoalt Hacker News

blibbleyesterday at 12:31 AM3 repliesview on HN

how is that possible?

tar.gz files don't have a central directory (like zip), and they are compressed as one stream (almost always non-seekable)


Replies

Dwedityesterday at 1:01 AM

.tar itself gives you enough information to seek forward past each file, though every file must be visited.

.gz does not give you enough information to randomly seek within the big compressed .gz file, so you cannot skip past files within a .tar archive.

But if you load a .gz file and consume the entire stream, but keep periodic checkpoints of your past sliding window (about 64KB) every 1MB or so, you can get random access with 1MB granularity. You still had to consume the entire stream to build the lookup though.

nine_kyesterday at 12:42 AM

Decompress, scan as you go, discard. Having to read a few hundred GB and scan a terabyte is a nuisance. Not having to write a terabyte is priceless.

show 1 reply
wslhyesterday at 12:44 AM

I am guessing the gzip is retrieved as a stream and then reading the tar from that stream in memory?