logoalt Hacker News

dcrazyyesterday at 6:43 PM2 repliesview on HN

How would one typically implement this tokenization? Pre-pass on the input? My initial thought was to push an operand-terminator token when encountering an operator, but it was unclear to me whether it should be pushed to the stack or the output.


Replies

justinpombrioyesterday at 9:21 PM

Parsing is usually implemented in two steps: first there's lexing where the input is turned into a sequence of tokens, then there's proper parsing where the sequence of tokens is turned into an abstract syntax tree. This shunting yard algorithm is part of parsing. Lexing happens before that.

For example, this input:

    13- 2  *5 + 4
would be lexed into this sequence of tokens:

    "13" "-" "2" "*" "5" "+" "4"
You could use the shunting yard algorithm to put these in a convenient postfix order:

    "13" "2" "5" "*" "-" "4" "+"
It's really easy to then turn it into an abstract syntax tree (AST). Go through each token in order: if it's a number you make it into a node and push it onto a stack; if it's a binary operator you pop two nodes off the stack and make a node out of those and the binary operator. If you write down this AST as an s-expression you get:

    (+ (- 13 (* 2 5)) 4)
fc417fc802yesterday at 10:13 PM

While the adjacent comment provides a proper answer to the broader question I thought I'd answer you more directly. A function that consumes one or more characters and then outputs a token will suffice. You turn the character stream into a token stream (or an altered character stream) on the fly. See lisp reader macros for example.