It looks like, overall, this design gets the parser about twice as fast as a simple one that creates tree-like ASTs.
That's not nothing. But a parser is rarely the most time-intensive part of a production compiler. And the parser does get iterated on a lot in languages that are evolving and adding new syntax.
Given that, I'd be inclined to take the performance hit and stick with a simpler AST representation if that yields a more hackable, maintainable compiler front end.
Usually yes, but it's still a neat trick to be aware of. For interpreted scripting languages, parsing can actually be a significant slowdown. Even more so when we start going into text-based network protocols, which also need a parser (is CSS a programming language or a network protocol? :) )
(author here) I agree that it's a lot of complexity, and I acknowledge this in the article: You can get quite far with just a bump allocator.
I didn't go into this at all, but the main benefit of this design is how well it interacts with CPU cache. This has almost no effect on the parser, because you're typically just writing the AST, not reading it. I believe that subsequent stages benefit much more from faster traversal.
(By the way, I am a huge fan of your work. Crafting interpreters was my introduction to programming languages!)
That's a good caution. However, traversing a flat AST (iterating a "struct of arrays" rather than a pointer-based tree) is also going to be faster. So the next steps of the compiler, say type checking and code emitting, will also be faster. But how much, or whether it's worth it even then, I'm not sure.