Appendix A • Hand Rolled Parser - A Character at a Time - 《Build your own Lisp》

A Character at a Time

A Character at a Time

Reading • A Jolly Good Time (tm).

The way we think about implementing a parser is quite different to the high level abstract view we were given with mpc. Instead of thinking about the language we instead need to think about the process.

Usually this process takes a very simple form - a parser is almost always just a loop, which repeatedly reads a character at a time from the input, and each time decides what to do with it. The challenge is in making this process elegant. It all starts to get a little messy when we think about whitespace, and comments, and everything else.

To give an idea of how it might work - in our Lisp, if we encounter the character d in the input, we can store it in some string, and also we know we must be reading in a symbol, so can enter a state where we look for more letters, each time adding them to the string. Once we’re found no more letters in the input we can return the whole thing as a symbol (for example def) and start again.

The function lval_read_expr is basically going to work like this. We’re going to take as input some string, some position in that string, and decide what to do next. When the next character isn’t the one specified by the argument end we will try to read in whatever thing appears next, create an lval object from it, and append it to the first argument v.

If instead we reach the character specified by end we’re going to return the next position in the string and return to the caller. This return value will help whoever calls lval_read_expr to see how much of the string it has consumed and how much is left.

For now let us assume the next character isn’t the end character. The first thing we need to check is that we’ve not reached the end of the input. If we’ve reached the end of the input without encountering the end character then we can throw a syntax error and jump to the end of the input to ensure no more is consumed.

int lval_read_expr(lval* v, char* s, int i, char end) {
  while (s[i] != end) {
    /* If we reach end of input then there is some syntax error */
    if (s[i] == '\0') {
      lval_add(v, lval_err("Missing %c at end of input", end));
      return strlen(s)+1;
    }

After this we can check if the next character is whitespace. Any whitespace characters we can just skip over as our language is not whitespace sensitive.

    /* Skip all whitespace */
    if (strchr(" \t\v\r\n", s[i])) {
      i++;
      continue;
    }

Another easy case is if the next character is a semi-colon ;. If it is a semi-colon we are starting a comment and we can ignore the rest of the characters until we reach a new line.

    /* If next char is ; then read comment */
    if (s[i] == ';') {
      while (s[i] != '\n' && s[i] != '\0') { i++; }
      i++;
      continue;
    }

If the next character is an open parenthesis ( or a curly bracket { we need to parse either an S-Expression or a Q-Expression. For these we can use lval_read_expr again and just supply it with a different character to end on and a different expression to write the results to.

    /* If next character is ( then read S-Expr */
    if (s[i] == '(') {
      lval* x = lval_sexpr();
      lval_add(v, x);
      i = lval_read_expr(x, s, i+1, ')');
      continue;
    }
    /* If next character is { then read Q-Expr */
    if (s[i] == '{') {
      lval* x = lval_qexpr();
      lval_add(v, x);
      i = lval_read_expr(x, s, i+1, '}');
      continue;
    }

Those are all the easy cases done. Now we need to decide what to do if we encounter some letter or number. In this case we need to parse all of the numbers or letters we can until we reach something that isn’t. For simplicity we’re going to treat numbers like a special case of symbols and if we encounter any of these we’re going to call the function lval_read_sym, which we’ll define later.

    /* If next character is part of a symbol then read symbol */
    if (strchr(
      "abcdefghijklmnopqrstuvwxyz"
      "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
      "0123456789_+-*\\/=<>!&", s[i])) {
      i = lval_read_sym(v, s, i);
      continue;
    }

We also have to deal with strings. If we reach a " character we’re going to have to consume everything we encounter up until the next unescaped ". For this we can call a function lval_read_str, which we’ll define later.

    /* If next character is " then read string */
    if (strchr("\"", s[i])) {
      i = lval_read_str(v, s, i+1);
      continue;
    }

Finally if we somehow encounter something else we better throw an error and skip to the end of the input, and as mentioned before, if we do actually match our end character and the while loop ends, we just need to return the updated position in the input.

    /* Encountered some unknown character */
    lval_add(v, lval_err("Unknown Character %c", s[i]));
    return strlen(s)+1;
  }
  return i+1;
}

That completes the body of our function lval_read_expr. Now we just need to fill in the gaps.