Scala Packrat Parser Combinators for DSLs

Authors: Raj Bains, Maciej Szpakowski

As a startup focused on building a product for Data Engineers one of the core skills in our team is Parsers and Functional Programming in Scala.

Previously we’ve worked on compilers for CUDA/Graphics at NVIDIA or for Visual Studio at Microsoft. One the other hand, we’ve worked on a DSL for Contract Definition Language (to express computable insurance & re-insurance contracts for catastrophic risk modeling), where we wrote the parser in F#.

The constraints for these two classes of compilers tend to be quite different. A hardware compiler typically takes a large team years to build, and often Lex/Yacc or Bison that’s LALR(*) is used for parsing. Recently, ANTLR that’s LL(*) is the parser of choice for many Java developers. These parsers quickly become complex and inscrutable.

As a startup, we develop at high speed with a lean team, and for we really liked Parser Combinators — since they’re simple, intuitive and fast to develop. We didn’t find comprehensive examples of Parser Combinators on the internet, so we’re adding one that introduces them and addresses practical issues when developing them.

Full code on Github is here. We’ll cover the following topics in the rest of the blog:

  1. What are Scala Parser Combinators?
  2. Solving Left Recursion with Packrat Parser Combinators
  3. Solving Operator Precedence
  4. Debugging

Let’s start with a motivating example where we’ll write a parser that can parse this simple function in pseudo-code, generating the Lexical Tokens and Abstract Syntax Tree (AST) shown below:

Initial code to be parsed

This requires a Lexer (also based on ParserCombinators) that generates these tokens:

Tokens from Lexical Analysis

and a Parser that consumes the tokens and produces the AST:

AST from parsing

What are Parser Combinators?

Parser combinators allow you to create tiny parsers and combine them to build very powerful parsers. Let’s say we want to use regular expressions to create a tokenizer. We can write one large expression and soon that’ll get complicated to comprehend. With parser combinators, each different token can have a regular expression parser that can be composed into one parser for the program.

Lets do this by example — we’ll start by trying to write a simple Lexer that reads the code string and breaks it down into tokens based on regular expressions:

Lexer based on RegexParser, full code
Visual Edits show instantaneously in Code Editor

Here, we note that there are 4 parsers, the identifier parser parses the regular expression, and on successful matching applies the function after ^^ that returns the IDENTIFIER case class (struct in Scala).

Composition of Parsers

Compositions of parsers can be done in a multiple ways. Or Composition as shown above with rep1(literal | identifier) that matches one or more repetitions of two parsers OR’ed together. This means that first the literal parser is applied at the current input position, and on failure the identifier parser is applied, so this gives us ordering of rules.

Sequential composition
matches one parser after the other in sequence. So LET() ~ variable ~ EQUAL() ~ expression4 ~ SEMI() matches let token (implicitly converted to let parser), on match tries to match variable parser, and so on…

Sequential composition of parsers

Solve Left recursion with Packrat Parsers

Recursive descent parsers including Parser combinators do not support left recursive grammars. This is often solved by modifying grammars to remove left recursion which is a painful process.

Therefore if you write a rule such as:

expression = expression ~ operator ~ expression ^^ {...}

you’ll get a stack overflow exception since you’re just recursing back into the same rule. However, Packrat parsers use lazy evaluation inbuilt in Scala to delay the evaluation till the result is required, allowing us to write such left recursive constructs.

Performance of Packrat Parsers

Lazy evaluation also provides memoization that makes sure that the matched rules are not applied again (by different paths), leading to linear time parsing at the expense of some memory. This also gives us infinite lookahead and supports a really large class of grammars. For the people who’re interested in knowing more, see Ford’s paper of Packrat parsing and Manohar’s implementation in Scala.

Operator Precendence

Operator precedence does not often lend itself to a clean solution. If you implement the parser simply you’d do it like this:

Sequential composition of parsers

You run this and quickly find the problem with precedence: the “+” incorrectly binds tighter than the “*”.

AST with incorrect precedence

In recursive descent parsers, the standard solution to this is shown to be a scheme like this:

expression = ["+"|"-"] term {("+"|"-") term} .

term = factor {("*"|"/") factor} .

factor =
ident
    | number
    | "(" expression ")"

This only works for a very simple case, but when you look at a real operator precedence such as the one for Java you see that there are 16 levels of unary and binary expressions that are to be interspersed with all the other expressions (eg. if-else, function calls) at each level.

For this blog, the parser we added shows 4 levels. We broke the overall structure of parsers using class inheritance:

Building Precedence into Lexer and Parser structure

Now, we create multiple levels of expression parsers, each overriding binary operator parser per precedence level.

Precedence with parsers

Debugging Parsers

One thing that doesn’t get talked about much is debugging of parsers. Sometimes a small typo might lead to a rule not being applied where you thought it should’ve been. Here is the debug output, that shows the rules being tried, which ones failed and which ones matched.

ebugging parsers

Finally

We’ve shown how to very quickly develop parsers for real world DSLs using parser combinators. If you’re using these in the field and want to exchange notes, please reach out to us!