logoalt Hacker News

irskeptoday at 6:47 AM1 replyview on HN

Working on mrjob was a big part of my first job out of college. Fun to see it get mentioned more than ten years later.

What some commenters don't realize about these bureaucratic IO-heavy expensive tools is that sometimes they are used in order to apply a familiar way of thinking, which has Business Benefits. Sometimes you don't know if your task will take seconds, minutes, hours, days, or weeks on one fast machine with a well-thought-out program, but you really need it to take at most hours, and writing well-thought-out-programs takes time you could spend on other stuff. If you know you can scale the program in advance, it's lower risk to just write it as a Hadoop job and be done with it. Also, it helps to have an "easy" pattern for processing Data That Feels Big Even If It Isn't That Big, Although Yelp's Data Actually Was Big. Such was the case with mrjob stuff at Yelp in 2012. They got a lot of mileage out of it!

The other funny thing about mrjob is that it's a layer on Hadoop Streaming, which is a term for when the Java process actually running the Hadoop worker opens a subprocess to your Python script which accepts input on stdin and writes output on stdout, rather than working on values in memory. A high I/O price to pay for the convenience of writing Python!


Replies

willtemperleytoday at 7:22 AM

That's a good point. Hadoop may not be the most efficient way, but when a deliverable is required, Hadoop is a known quantity and really works.

I did some interesting work ten years ago, building pipelines to create global raster images of the entire Open Street Map road network [1]. I was able to process the planet in 25 minutes on a $50k cluster.

I think I had the opposite problem: Hadoop wasn't shiny enough and Java had a terrible reputation in academic tech circles. I wish I'd known about mrjob because that would have kept the Python maximalists happy.

I had lengthy arguments with people who wanted to use Spark which simply did not have the chops for this. With Spark, attempting to process OSM for a small country failed.

Another interesting side-effect of using the map-reduce paradigm was with processing vector datasets. PostGIS took multiple days to process the million-vertex Norwegian national parks, however splitting the planet into data density sensitive tiles (~ 2000 vertices) I could process the planet in less than an hour.

Then Google Earth Engine came along and I had to either use that, or change career. Somewhat ironically GEE was built in Java.

[1] https://github.com/willtemperley/osm-hadoop