That's a good point. Hadoop may not be the most efficient way, but when a deliverable is requir...

willtemperley • today at 7:22 AM • 0 replies • view on HN

That's a good point. Hadoop may not be the most efficient way, but when a deliverable is required, Hadoop is a known quantity and really works.

I did some interesting work ten years ago, building pipelines to create global raster images of the entire Open Street Map road network [1]. I was able to process the planet in 25 minutes on a $50k cluster.

I think I had the opposite problem: Hadoop wasn't shiny enough and Java had a terrible reputation in academic tech circles. I wish I'd known about mrjob because that would have kept the Python maximalists happy.

I had lengthy arguments with people who wanted to use Spark which simply did not have the chops for this. With Spark, attempting to process OSM for a small country failed.

Another interesting side-effect of using the map-reduce paradigm was with processing vector datasets. PostGIS took multiple days to process the million-vertex Norwegian national parks, however splitting the planet into data density sensitive tiles (~ 2000 vertices) I could process the planet in less than an hour.

Then Google Earth Engine came along and I had to either use that, or change career. Somewhat ironically GEE was built in Java.

[1] https://github.com/willtemperley/osm-hadoop

alt Hacker News