I have this idea that a tiny LM would be good at canonicalizing entered real estate addresses. We c...

linsomniac • 01/22/2025 • 1 reply • view on HN

I have this idea that a tiny LM would be good at canonicalizing entered real estate addresses. We currently buy a data set and software from Experian, but it feels like something an LM might be very good at. There are lots of weirdnesses in address entry that regexes have a hard time with. We know the bulk of addresses a user might be entering, unless it's a totally new property, so we should be able to train it on that.

Replies

thesz • 01/22/2025

From my experience (2018), run LLM output through beam search over different choices of canonicalization of certain part of text. Even 3-gram models (yeah, 2018) fare better this way.

alt Hacker News

Replies