From my experience (2018), run LLM output through beam search over different choices of canonicalization of certain part of text. Even 3-gram models (yeah, 2018) fare better this way.