logoalt Hacker News

linsomniac01/22/20251 replyview on HN

I have this idea that a tiny LM would be good at canonicalizing entered real estate addresses. We currently buy a data set and software from Experian, but it feels like something an LM might be very good at. There are lots of weirdnesses in address entry that regexes have a hard time with. We know the bulk of addresses a user might be entering, unless it's a totally new property, so we should be able to train it on that.


Replies

thesz01/22/2025

From my experience (2018), run LLM output through beam search over different choices of canonicalization of certain part of text. Even 3-gram models (yeah, 2018) fare better this way.