Rust for tokenising and parsing

285 points • by thunderbong • 11/08/2024 • 295 comments • view on HN

Comments

miningape • 11/08/2024

About 2 months ago I would have said the same as the author, but I kept running against the hard edges of Rust: the borrow checker. I realised that while I really liked using algebraic data types (e.g. Enums) and pattern matching, the borrow checker and the low level memory concerns meant I spent a lot of time fighting the borrow checker instead of fighting the PL issues at the heart of my project. So while tokenising/parsing was nice, interpreting and typechecking became the bane of my existence

With that realisation I started looking for another more suitable language - I knew the FP aspects of Rust are what I was looking for so at first I considered something like F# but I didn't like that it's tied to microsoft/.NET. Looking a bit further I could have gone with something like Zig/C but then I lose the FP niceness I'm looking for. I also spent a fair amount of time looking at Go, but eventually decided that 1. I wanted a fair amount of syntax sugar, and 2. golang is a server side language, a lot of its features and library are geared towards this use case.

Finally I found OCaml, what really convinced me was seeing the syntax was like a friendly version of Haskell, or like Rust without lifetimes. In fact the first Rust compiler was written in OCaml, and OCaml is well known in the programming language space. I'm still learning OCaml so I'm not sure I can give a fair review yet, but so far it's exactly what I was looking for.

➕ show 18 replies

noelwelsh • 11/08/2024

This is, to me, an odd way to approach parsing. I get the impression the author is relatively inexperienced with Rust and the PL ideas it builds on.

A few notes:

* The AST would, I believe, be much simpler defined as an algebraic data types. It's not like the sqlite grammar is going to randomly grow new nodes that requires the extensibility their convoluted encoding requires. The encoding they uses looks like what someone familiar with OO, but not algebraic data types, would come up with.

* "Macros work different in most languages. However they are used for mostly the same reasons: code deduplication and less repetition." That could be said for any abstraction mechanism. E.g. functions. The defining features of macros is they run at compile-time.

* The work on parser combinators would be a good place to start to see how to structure parsing in a clean way.

➕ show 2 replies

ryandv • 11/08/2024

I don't know. Having written a small parser [0] for Forsyth-Edwards chess notation [1] Haskell takes the cake here in terms of simplicity and legibility; it reads almost as clearly as BNF, and there is very little technical ceremony involved, letting you focus on the actual grammar of whatever it is you are trying to parse.

[0] https://github.com/ryandv/chesskell/blob/master/src/Chess/Fa...

[1] https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notati...

➕ show 3 replies

tptacek • 11/08/2024

So, just to kick this off: I wrote an eBPF disassembler and (half-hearted) emulator in Rust and I also found it a pleasant language to do parsing-type stuff in. But: I think the author cuts against their argument when they manage to necessitate a macro less than 1/6th of the way into their case study. A macro isn't quite code-gen, but it also doesn't quite feel like working idiomatically within the language, either.

Again: not throwing shade. I think this is a place where Rust is genuinely quite strong.

➕ show 3 replies

gritzko • 11/08/2024

I have the experience of writing parsers (lexers) in Ragel, using Go, Java C++, and C. I must say, once you have some boilerplate generator in place, raw C is as good as the Rust code the author describes. Maybe even better because simplicity. For example, this is the most of code necessary to have a JSON parser: https://github.com/gritzko/librdx/blob/master/JSON.lex

In fact, that eBNF only produces the lexer. The parser part is not that impressive either, 120 LoC and quite repetitive https://github.com/gritzko/librdx/blob/master/JSON.c

So, I believe, a parser infrastructure evolves till it only needs eBNF to make a parser. That is the saturation point.

➕ show 3 replies

hu3 • 11/08/2024

Related, I love Rob Pike's talk about lexical Scanning in Go (2011).

Educational and elegant approach.

https://www.youtube.com/watch?v=HxaD_trXwRE

➕ show 2 replies

sshine • 11/08/2024

One mind-blowing experience for me:

I can take my parser combinator library that I use for high-level compiler parsers, and use that same library in a no-std setting and compile it to a micro-controller, and deploy that as a high-performance protocol parser in an embedded environment. Exact same library! Just with fewer String and more &'static str.

So toying around with compilers translates my skill-set rather well into doing embedded protocol parsers.

brundolf • 11/08/2024

Something that was hard when I wrote a full AST parser in Rust was representing a hierarchy of concrete AST types, with upcasting and downcasting. I was able to figure out a way, but it required some really weird type shenanigans (eg PhantomData) and some macros. Looks like they had to do crazy macros here too

Curious what the rest of the prior art looks like

➕ show 2 replies

ketzo • 11/08/2024

So how do you debug code written with macros like this, or come into it as a new user of the codebase?

I’m imagining seeing the node! macro used, and seeing the macro definition, but still having a tough time knowing exactly what code is produced.

Do I just use the Example and see what type hints I get from it? Can I hover over it in my IDE and see an expanded version? Do I need to reference the compiled code to be sure?

(I do all my work in JS/TS so I don’t touch any macros; just curious about the workflow here!)

➕ show 2 replies

nhatcher • 11/08/2024

Well good luck parsing sqlite syntax! I had to write a (fairly small) subset sqlite parser for work a couple of years ago. I really like sqlite, it's always a source of inspiration.

The railroad diagrams are tremendously useful:

https://www.sqlite.org/syntaxdiagrams.html

I don't think the lemon parser generator gets enough credit:

https://sqlite.org/src/doc/trunk/doc/lemon.html

With respect of the choice of the language, any language with Algebraic Data Types would work great. Even Typescript would be great for this.

FWIW I wrote a small introduction to writing parsers by hand in Rust a while ago:

https://www.nhatcher.com/post/a-rustic-invitation-to-parsing...

ForHackernews • 11/08/2024

I'll throw in a plug for https://pest.rs/ a PEG-based parser-generator library in Rust. Delightful to work with and removes so much of the boilerplate involved in a parser.

➕ show 1 reply

djfobbz • 11/08/2024

Sorry my OCD is kicking in but "Asterisk" is spelled wrong as "Asteriks" in your entire sample code.

WiSaGaN • 11/08/2024

I think except macros, most of these features are ML family language features as well. Rust stands out because it can implement this in an efficient, zero overhead abstraction way.

kldx • 11/08/2024

I like MegaParsec in haskell quite expressive, based on my limited experience using nom in Rust

samsartor • 11/08/2024

Imperative rust is really good for parsing, but you can also get a long way with regexes. Especially if you are just prototyping or doing Advent of Code.

I do still like declarative parsing over imperative, so I wrote https://docs.rs/inpt on top of the regex crate. But Andrew Gallant gets all the credit, the regex crate is overpowered.

sksxihve • 11/08/2024

I've found that the logos crate is really nice for writing lexers in rust

https://docs.rs/logos/0.14.2/logos/

omani • 11/08/2024

this is the third day in a row this article is being posted here.

this time it got traction. funny how HN works.

https://news.ycombinator.com/item?id=42055954

https://news.ycombinator.com/item?id=42058920

James_K • 11/08/2024

Mentioning macros as a reason to love Rust goes against my experience with them.

➕ show 1 reply

jamra • 11/08/2024

Does anyone have a good EBNF notation for Sqlite? I tried to make a tree-sitter grammar, which produces C code and great Rust bindings for it. But they use some lemon parser. Not sure how to read the grammar from that.

➕ show 5 replies

nicoco • 11/08/2024

Every rust article: "Look how great this rust feature is and how clean and concise the resulting code is!"

Me: "How can a programming language be so damn complex? Am I just dumb?"

➕ show 2 replies

jurschreuder • 11/08/2024

I cannot agree less, C++ is the best and always will be. You youngsters made up this new dialect that can also compile with the C++ compiler. This is like people putting VS Code in dark mode thinking they're now also working in the Terminal like the Gods of Binary.

➕ show 1 reply

alt Hacker News

Rust for tokenising and parsing

Comments