Recap & Introduction

Rea Code: https://github.com/lackhoa/rea-engine

This is a series about a theorem prover using C++. In part 1 we stated the problem as well set up a framework to attack it. In this part, I… attempt to figure out what the heck I did for this program, meanwhile telling you about the only thing I can still make sense of: the parser.

When making a programming language, the parser is the first thing we have to build, because it is directly in charge of reading our program text. You might say “Well wait a minute, isn’t the OS in charge of that?”. And you’d be correct, but that part has nothing specific to do with our project, so I skipped it. In this post I assume we already have the whole program text in memory and work from there. I don’t assume any prior CS compiler knowledge, as I’ll explain everything along the way.

What is parsing and why do we need it?

As you can already know, I want to justify every concept I mention. Parsing is a common concept used in all compilers, but it’s not immediately obvious why it’s there. Well, the reason is in the way we chose to encode the information.

Suppose we read from a file this line of text (a+b)+c which is stored as a one-dimensional stream of bytes, but what we really want is this tree

    +
   / \
  +   c
 / \
a  b

Why do we want this tree? Well because if we have the tree, we can answer these questions:

  • Q: What’s the operator? A: +
  • Q: What’s the first operand? A: a+b
  • Q: What’s the second operand? A: c
  • Q: What’s the operator of the first operand? A: +

In other words, we will have a complete structure of the text and have pointers to jump to different parts of it. If you just work with the text (a+b)+c, you’d have no idea how to answer these questions. To see why, try computing this expression by hand:

3 + (1 - 2) * 3 * 2 / 2 + 2 + 3 - (4 - 2)*2 + 2 - 3 - 1 - 3 * 2

Note that while the arithemtic isn’t complicated, the structure is. The reason why you can’t do it very quickly is because you don’t understand what the expression is saying. Well, computers have the same problem. So that’s what parsers do: they convert text into trees.

A question you might ask is: can we somehow avoid this problem? If we could just give this tree to the computer, we wouldn’t need parsing. I guess that is the case for these visual programming languages. Although I imagine they’d still need to convert the tree into binary data when it’s time to store the program, thus end up solving pretty much the same problem anyway.

First off: Lexing

Lexing is a sub-problem of parsing. The job is even simpler: convert a stream of characters into a stream of tokens.

What are tokens and why do we need them? First let me show you some code, since matters are getting sufficiently detailed.

struct String
{
  char *chars;
  i32   length;                 // note: does not include the nil terminator
  operator bool() {return chars;}
};

struct Token
{
  String    string;
  i32       line;
  i32       column;
  TokenKind kind;
  operator String() {return string;};
};

enum TokenKind
{
  Token_Colon   = ':',
  // 0-255 reserved for single-char ASCII types.

  Token_Special = 256,
  ...
  Token_Arrow,
  Token_StrongArrow,

  Token_Keyword_START,
  Token_Keyword_fn,
  Token_Keyword_union,
  ...
};

Tokens are portions of text that go together. The reason why we need tokens is again due to our linguistic needs. We wouldn’t need tokens if the code we expect always looks like this:

( a + by ) + c

rather than this

(a+by)+c

Do you see the problem now? We want the letter “a” to be its own separate thing and not stick to the “(“ before it or the “+” after it. Since we don’t want to put spaces everywhere, we make the computer figure it out.

The core of the tokenizer (maybe I should have called it the lexer :>) is the code to get the next token, which looks like so

inline char nextChar();

forward_declare inline Token *
eatToken()
{
  auto tk = TK;
  Token out = {};
  out.string.chars = tk->at;
  out.line         = tk->line;
  out.column       = tk->column;

  switch (char first_char = nextChar())
  {
    case '"':
    {
      out.string.chars++; // advance past the opening double quote
      out.kind = Token_StringLiteral;
      while (*tk->at != '"')
        nextChar();
      // handle the closing double quote
      nextChar();
      out.string.length = (i32)(tk->at - out.string.chars - 1);
    } break;

    <... bunch of cases>

    default:
    {
      if (isAlphaNumeric(first_char))
      {
        out.kind = Token_Alphanumeric;
        while (isAlphaNumeric(*tk->at))
          nextChar();
      }
      else if (isSpecial(first_char))
      {
        out.kind = Token_Special;
        while (isSpecial(*tk->at))
          nextChar();
      }
      else
        out.kind = (TokenKind)first_char;
    } break;
  }

  tk->last_token = out;
  // NOTE: :always-eat-spaces We eat spaces afterward, so that we can always
  // check *tk->at to see if there's anything left to parse.
  eatAllSpaces();
  return &tk->last_token;
}

Nothing much to say here… that’s all the tokenizer does. We request the next token, it returns the next token to us if it can. So for the example “(a+by)+c” above, it will return this stream of tokens

- kind: (;          string: (
- kind: identifier; string: a
- kind: special;    string: +
- kind: identifier; string: by
- kind: );          string: )
- kind: special;    string: +
- kind: identifier; string: c

What is “kind”? For example, all keywords have their own kind: “fn”, “union”, “in”, “let”, etc. While technically the token kind can be inferred from the token itself, we don’t do that since it incurs a cost. For example to check if a token is the keyword “union”, we need to compare it against the string “union”; the longer the string, the longer the comparison takes. So I guess we store the kind in the token because we expect to reuse it more than once. Here’s an example of usage for the token kind

inline b32
isExpressionEndMarker(Token *token)
{
  ...
  switch (token->kind)
  {
    case Token_Directive:
    case Token_DoubleColon:
    case Token_ColonEqual:
    case Token_DoubleDash:
    case Token_StrongArrow:
    case Token_Keyword_in:
    case Token_DoubleDot:
    {
      return true;
    }
    default: {}
  }

  return false;
}

Parsing

A parser takes the stream of tokens and builds a tree (remember the tree? That’s our goal!). In compiler lingo we call it the AST (short for “Abstract Syntax Tree”, pronounced “assed”). To be honest I don’t really get (or care about) the “abstract” part, we could just call it “syntax tree”. But since the acronymn AST is very handy to use for naming things in the code, that’s the term we use here.

Note that I used “a parser” instead of “the parser”, because there are many of them. Parsers are just functions, really.

How would I describe a parser function? We already know that it the kind of function that reads tokens and based on the token it read, builds an AST tree. It probably calls other parsers at some point, perhaps even itself. Sometimes it reports parse errors and fails. If there’s no error, it returns the AST tree it has built to the caller.

You don’t write these parsers all at once, you write them as you build out the language. I myself had to continuously write new parsers right to the end of the project. Most of them are terribly boring. But the expression parser is the first one you have to write, and also quite interesting.

An expression is a mathematical expression like (a+b)+c*d. It is composed of operands (such as “a”) that are glued together using binary operators (such as “+”). Operands are also expressions, which is why expressions are tree.

The expression syntax structure is pretty much the building block of our parser (in the sense that they almost always appear at the lowest level of our AST tree). Our requirement is also pretty complicated, for example: how do you correctly parse this?

(a+b)+c*d

Here’s the correct tree

      +
     / \
    /   \
   +     *
  / \   / \
 a   b c   d

Here’s an incorrect one

       *
      / \
     +   \
    / \   \
   +   c   d
  / \
 a   b

Do you know what I’m talking about? It’s like the rule of reading math expressions: parentheses first, then comes multiplication and division, then addition and subtraction. That’s called operator “precedence”.

This is again not a universal requirement of computer languages. For example we could always parenthesize everything like in Lisp

(+ (+ a b) (* c d) e)

which is more pleasant in some ways, but the lack of infix operators and operator precedence is are especially problematic for mathematics, for example

a = b -> c = d

is way easier to read than

(-> (= a b) (= c d))

Alright, so how to resolve this operator precedence business? The solution I’ve found online is pretty simple. Here’s the abridged version of the code.

struct Ast {
  AstKind kind;
  Token token;
  u32   flags;
  u32   serial;
};

struct Identifier : Ast {
  // NOTE: Since the ast has a token, which already has the identifier in it, we
  // don't need to put it in the identifier struct.
};

struct CompositeAst : Ast {
  Ast  *op;
  i32   arg_count;
  Ast **args;
};

struct ParseExpressionOptions {
  i32 min_precedence = -9999;
};

inline b32
isExpressionEndMarker(Token *token)
{
  if (inString(")]}{,;:", token))
  {
    return true;
  }

  return false;
}

internal Ast *
parseOperand()
{
  Ast *operand = 0;
  Token token = nextToken();
  Arena *arena = parse_arena;
  switch (token.kind)
  {
    case '(':
    {
      operand = parseExpression();
      requireChar(')');
    } break;

    case Token_Alphanumeric:
    case Token_Special:
    {
      operand = newAst(arena, Identifier, &token);
    } break;

    default:
    {
      tokenError("expected start of expression");
    }
  }
  NULL_WHEN_ERROR(operand);
  return operand;
}

internal Ast *
parseExpression_(ParseExpressionOptions opt)
{
  Arena *arena = parse_arena;
  Ast *out = 0;
  if (Ast *operand = parseOperand())
  {
    // (a+b) * c
    //     ^
    for (b32 stop = false; !stop && hasMore();)
    {
      Token op_token_ = peekToken(); auto op_token = &op_token_;
      if (isIdentifier(op_token))
      {// infix operator syntax
        // (a+b) * c
        //        ^
        Identifier *op = newAst(arena, Identifier, op_token);
        i32 precedence = precedenceOf(op_token->string);
        if (precedence >= opt.min_precedence)
        {
          // recurse
          eatToken();
          // a + b * c
          //      ^
          ParseExpressionOptions opt1 = opt;
          opt1.min_precedence = precedence;

          if (Ast *recurse = parseExpression_(opt1))
          {
              i32 arg_count = 2;
              Ast **args    = pushArray(arena, arg_count, Ast*);
              args[0] = operand;
              args[1] = recurse;

              CompositeAst *new_operand = newAst(arena, CompositeAst, op_token);
              new_operand->op        = op;
              new_operand->arg_count = arg_count;
              new_operand->args      = args;
              operand = new_operand;
          }
        }
        else
        {
          // we are pulled to the left
          // a * b + c
          //      ^
          stop = true;
        }
      }
      else if (isExpressionEndMarker(op_token))
      {
        stop = true;
      }
      else
      {
        tokenError(op_token, "expected operator token");
      }
    }
    if (noError())
      out = operand;
  }

  NULL_WHEN_ERROR(out);
  return out;
}

To understand what the code does, let’s walk through the process of parseExpression on the input text (a+b)+c*d.

(a+b)*c+d
^
parseExpression calls parseOperand()
We'll describe what this nested call does using one vertical bar.
| parseOperand sees "(", so it calls parseExpression
 (a+b)*c+d
  ^
|| parseExpression calls parseOperand()
||| parseOperand sees "a", so it returns the identifier AST "a"
 (a+b)*c+d
   ^
|| the next token is an identifier "+",
   whose precedence not smaller than "min_precedence",
   so we recurse with the precedence of "+"
 (a+b)*c+d
    ^
||| parseExpression sees "b" and then ")",
    which is an expression end marker,
    so it just returns identifier "b"
 (a+b)*c+d
     ^
|| (coming out of the recursion) we build up a new composite expression "a+b"
|| we see ")", which is an expression end marker, so we stop parsing and return "a+b"
| parseOperand returns the operand "(a+b)"
 (a+b)*c+d
      ^
parseExpression now has one operand "(a+b)", it sees the operator "*",
so we recurse with the "min_percedence" parameter set to "*"!

 (a+b)*c+d
       ^
| we parse the operand "c"
 (a+b)*c+d
        ^
| here comes the interesting part: we see an operator "+".
  Previously we have been recursing, but this time "parseExpression" has been constrained by "*".
  So we stop and return "c".
  If we had continued, we would have returned "c+d",
  which would be wrong since "c+d" is not a sub-expression that appeared anywhere in the input expression.

parseExpression receives the new operand "c",
then combined it with what we have so far to build up the expression "(a+b)*c"

and you can check that we proceed merrily to glue what we have with "d"
to get "((a+b)*c) + d".
Then we return the result since there's nothing more to be parsed.

Please verify that the output AST is correct.

I use these kind of code walk-throughs like the above example a lot during development of the language. It works extremely well in understanding code, developing new algorithms as well as debugging.

I learned about this technique from a video by Jonathan Blow. I don’t know the original source since he didn’t cite one, I myself didn’t read it either.

Putting everything together

Ok so let’s put together what the parser does when it looks at one specific line of code.

Input Text: (ax+by)*c+d

AST
                 +
                / \
               *   \
              / \   \
             /   \   \
            +     \   \
           / \     \   \
          /   \     \   \
         /     \     \   \
Token ( ax  +  by ) * c + d

Note that the fundamental structure of the parsing job arose from our requirements of the language. When the language changes, the parser would look very different. That said, our language requirements are common to the majority of programming languages, which is why I’m writing about this in the first place.

Also there’s one more thing to point out about our specific implementation: in the token struct there is line and column pointer to the source code; and in the AST struct there is a token.

struct Token
{
  String    string;
  i32       line;
  i32       column;
  TokenKind kind;
  operator String() {return string;};
};

struct Ast {
  AstKind kind;
  Token token;
  u32   flags;
  u32   serial;
};

That way you can always trace from the expression all the way down to the source code, which is very useful for debugging and error reporting, like so

(a + "3")
     ^
Error: cannot add to a string

In conclusion, that’s one simple way in which a parser could work. Hopefully this has been comprehensive. If I left out some part, it must have been accidental and not on purpose. I just want to show you that this is very simple stuff and you don’t need a CS degree to write one. You don’t need a CS degree to do anything for that matter.


<
Previous Post
Rea: a theorem prover written in C++ (Part 1: Problem Statement & Motivation)
>
Next Post
Rea: a theorem prover written in C++ (Part 3: The type system)