Working around the performance concerns by only running this algorithm after a faster algorithm found an error is an interesting idea. I think it could work, as things that produce errors would in a real compiler be marked as "poisoned" to be ignored in further analysis and IR lowerings. Thus the two algorithms disagreeing on a type wouldn't cause a noticable difference in the end result.
Comparative benchmarks are tricky. I considered making a simpler single-pass inference branch which is based around the same data structures to create a more one-to-one comparison. But this algorithm is rather different so porting it wasn't very straight forward. I'm currently integrating this into my real compiler so from there it'll be easier to estimate the real-world performance impact of this system.
Typo has been fixed, thanks!
Couldn’t the comparison (and also the fast path I guess) just be putting all the inference passes into a single big pass, which then avoids the quadratic number of re-applications of passes? Looks like unification is otherwise the same?