logoalt Hacker News

mekenyesterday at 5:13 PM1 replyview on HN

I’m curious about the memory usage of the cat | grep part of the pipeline. I think the author is processing many small files?

In which case it makes the analysis a bit less practical, since the main use case I have for fancy data processing tools is when I can’t load a whole big file into memory.


Replies

dapperdrakeyesterday at 7:58 PM

Memory footprint is tiny:

Unix shell pipelines are task-parallel. Every tool gets spun up as its own unix process — think "program" (fork-exec). Standard input and standard output (stdin, stdout) get hooked up to pipes. Pipes are like temporary files managed by the kernel (hand-wave). Pipe buffer size is a few KB. Grep does a blocking read on stdin. Cat writes to stdout. Both on a kernel I/O boundary. Here the kernel can context-switch the process when waiting for I/O.

In the past there was time-slicing. Now with multiple cores and hardware threads they actually run concurrently.

This is very similar to old-school approach to something like multiple threads, but processes don’t share virtual address spaces in the CPU's memory management unit (MMU).

Further details: look up McIlroy's pipeline design.