how is it hard, do a A to D, add a filter, do compute, then do D to A.
The article covers that.
In short, audio and visual perception do not map perfectly. Humans don't have a linear perception of either so a perfect A to D then D to A conversion yields unsatisfying results.
Not hard to do, hard to do well. Hiding all complexity with a hand wavey “do compute” doesn’t make that bit easy