Geocoding is a really fun (and sometimes frustrating) problem I've been lucky enough to have be...

juliansimioni • today at 2:50 PM • 0 replies • view on HN

Geocoding is a really fun (and sometimes frustrating) problem I've been lucky enough to have been working to solve for over 10 years now.

I joined Mapzen in 2015 which ostensibly was part of a Samsung startup accelerator, but looking back, it's more descriptive to say it was an open-source mapping software R&D lab. We built what is now foundational open-source geospatial tools like the Pelias geocoder (my team) and the Valhalla routing engine. A lot more projects like the Tangram map renderer are still really useful post-Mapzen.

A reasonable, but very wrong, first assumption about geocoding is that with a database of places you're almost there. Inputs are often structured, like some addresses, but the structure has so many edge cases you also have to effectively consider it unstructured. The data is the same, sometimes worse as a lot of data sources are quite bad.

Over the last 10 years we've explored most strategies for full text search, and no ONE solution knocks it out of the park. We started with really simple "bag of words" search, just looking at token matches. That, fairly predictably was mostly a mess. With billions of places in the world recorded in open datasets, there's going to be something irrelevant somewhere that matches, and probably drowns out whatever you're looking for.

Parsing inputs for structure is an enticing option too, but for any pattern you can come up with, there's either a search query or some data that will defeat that structure (try me).

The previous generation of ML and a lot of sweat by Al Barrentine produced libpostal(https://github.com/openvenues/libpostal), which is a really great full-text address parser. It's fast and accurate, but it doesn't handle partial inputs (like for autocomplete search), doesn't offer multiple parsing interpretations, and still isn't always right.

What we've settled on for now for autocomplete is a pretty sophisticated but manually configured parser, which can return multiple interpretations and is also quick to fall back to "i don't know" (how can you really parse meaning out of a short input like "123": is it the start of a postalcode? a housenumber? the name of a restaurant?). It's also runtime bound to make sure it always returns in a few milliseconds or less, since autocomplete is extremely latency sensitive. Then we can either search with the benefit of more structure, or worst case fall back to unstructured, with a LOT of custom logic, weights, filters, and other tricks as well.

A big question right now is will next generation LLMs completely solve geocoding, and honestly I'm not sure. Even older ML is really eager to over-generalize rules, and while newer LLMs do that less, they also still hallucinate, which is pretty much a dealbreaker for geocoding. At least for now LLMs are also orders of magnitude too slow, and would never be cost effective at current prices. Personally I think us geocodeurs will be in business a while longer.

There's so much more about geocoding I love talking about, it's truly a niche filled with niches all the way down. This is the sort of stuff we are always iterating on with our business Geocode Earth (https://geocode.earth/). We think we have a really compelling combination of functionality, quality, liberal usage license (hi simonw!), respect for privacy, and open-source commitment. We always love hearing from people interested in anything geocoding so say hello :)

alt Hacker News