Scala Packrat Parser Combinators for DSLs

Authors:
Raj Bains | Maciel Szpakowski

As a startup focused on building a product for Data Engineers one of the core skills in our team is Parsers and Functional Programming in Scala.

Previously we’ve worked on compilers for CUDA/Graphics at NVIDIA or for Visual Studio at Microsoft. One the other hand, we’ve worked on a DSL for Contract Definition Language (to express computable insurance & re-insurance contracts for catastrophic risk modeling), where we wrote the parser in F#.

The constraints for these two classes of compilers tend to be quite different. A hardware compiler typically takes a large team years to build, and often Lex/Yacc or Bison that’s LALR(*) is used for parsing. Recently, ANTLR that’s LL(*) is the parser of choice for many Java developers. These parsers quickly become complex and inscrutable.

As a startup, we develop at high speed with a lean team, and for we really liked Parser Combinators — since they’re simple, intuitive and fast to develop. We didn’t find comprehensive examples of Parser Combinators on the internet, so we’re adding one that introduces them and addresses practical issues when developing them.

Full code on Github is here. We’ll cover the following topics in the rest of the blog:

  • What are Scala Parser Combinators?
  • Solving Left Recursion with Packrat Parser Combinators
  • Solving Operator Precedence
  • Debugging

Let’s start with a motivating example where we’ll write a parser that can parse this simple function in pseudo-code, generating the Lexical Tokens and Abstract Syntax Tree (AST) shown below:

<p> CODE:</p> https://gist.github.com/raj-bains-sdl/c710a24a96c38a5f037b9c6968bf7636.js<p></p>

def foo(a: Int, b: Int, c: Int) : Int = {  let d = a + b * c;  let e = a + b + d;  return e;}

Initial code to be parsed

This requires a Lexer (also based on ParserCombinators) that generates these tokens:

List(  DEF(), IDENTIFIER("foo"), OPEN_PAREN(),    IDENTIFIER("a"), COLON(), INTEGER(), COMMA(),    IDENTIFIER("b"), COLON(), INTEGER(), COMMA(),    IDENTIFIER("c"), COLON(), INTEGER(),   CLOSE_PAREN(), COLON(),INTEGER(), EQUAL(),  OPEN_CURLY(),    LET(), IDENTIFIER("d"), EQUAL(), IDENTIFIER("a"), BIN_OP2("+"), IDENTIFIER("b"), BIN_OP1("*"), IDENTIFIER("c"), SEMI(),    LET(), IDENTIFIER("e"), EQUAL(), IDENTIFIER("a"), BIN_OP2("+"), IDENTIFIER("b"), BIN_OP2("+"), IDENTIFIER("d"), SEMI(),    RETURN(), IDENTIFIER("e"), SEMI(),  CLOSE_CURLY())

Tokens from Lexical Analysis

and a Parser that consumes the tokens and produces the AST:

FunctionDefinition(    "foo",    List(DeclarationTerm("a", "Integer"), DeclarationTerm("b", "Integer"), DeclarationTerm("c", "Integer")),    List(      AssignStatement(        VariableTerm("d"),        BinOpExpression(          "+",          SimpleExpression(None, Some(VariableTerm("a"))),          BinOpExpression(            "*",            SimpleExpression(None, Some(VariableTerm("b"))),            SimpleExpression(None, Some(VariableTerm("c")))          )        )      ),      AssignStatement(        VariableTerm("e"),        BinOpExpression(          "+",          BinOpExpression(            "+",            SimpleExpression(None, Some(VariableTerm("a"))),            SimpleExpression(None, Some(VariableTerm("b")))          ),          SimpleExpression(None, Some(VariableTerm("d")))        )      ),      ReturnStatement(SimpleExpression(None, Some(VariableTerm("e"))))    )  )

What are Parser Combinators?

Parser combinators allow you to create tiny parsers and combine them to build very powerful parsers. Let’s say we want to use regular expressions to create a tokenizer. We can write one large expression and soon that’ll get complicated to comprehend. With parser combinators, each different token can have a regular expression parser that can be composed into one parser for the program.

Lets do this by example — we’ll start by trying to write a simple Lexer that reads the code string and breaks it down into tokens based on regular expressions:

class Lexer extends RegexParsers {  ...    def tokens: Parser[List[Token]] = {    phrase(rep1(literal | identifier | singleLineComment | multiLineComment)) ^^ { rawTokens =>       rawTokens.filter(_ != COMMENT())    }  }  def identifier: Parser[IDENTIFIER] = positioned { "[a-zA-Z_][a-zA-Z0-9_]*".r ^^ { str => IDENTIFIER(str) } }  def literal   : Parser[LITERAL]    = positioned { """[0-9]+""".r             ^^ { dub ⇒ INT_LITERAL(dub.toInt) } }    def singleLineComment: Parser[COMMENT] = "//" ~ rep(not("\n") ~ ".".r) ^^^ COMMENT()  def multiLineComment:  Parser[COMMENT] = "/*" ~ rep(not("*/") ~ "(?s).".r) ~ "*/" ^^^ COMMENT()}

Lexer based on RegexParser, full code

Here, we note that there are 4 parsers, the identifier parser parses the regular expression, and on successful matching applies the function after ^^ that returns the IDENTIFIER case class (struct in Scala).

Composition of Parsers

Compositions of parsers can be done in a multiple ways. Or Composition as shown above with rep1(literal | identifier) that matches one or more repetitions of two parsers OR’ed together. This means that first the literal parser is applied at the current input position, and on failure the identifier parser is applied, so this gives us ordering of rules.

Sequential composition matches one parser after the other in sequence. So LET() ~ variable ~ EQUAL() ~ expression4 ~ SEMI() matches let token (implicitly converted to let parser), on match tries to match variable parser, and so on…

lazy val statement: PackratParser[Statement] = positioned {    val assignStatement = {      LET() ~ variable ~ EQUAL() ~ expression4 ~ SEMI() ^^ {        case _ ~ (variable: VariableTerm) ~ _ ~ (expression: Expression) ~ _ ⇒          if (debugMatch) logger.info(s"PARSE: statement: Assign $variable = $expression")          AssignStatement(variable, expression)      }    }    val returnStatement = {      RETURN() ~ expression4 ~ SEMI() ^^ {        case _ ~ (expression: Expression) ~ _ ⇒          if (debugMatch) logger.info(s"PARSE: statement: Return $expression")          ReturnStatement(expression)      }    }    dbg(assignStatement)(name = "assignStatement") |      dbg(returnStatement)(name = "returnStatement")  }

Sequential composition of parsers

Solve Left recursion with Packrat Parsers

Recursive descent parsers including Parser combinators do not support left recursive grammars. This is often solved by modifying grammars to remove left recursion which is a painful process.

Therefore if you write a rule such as:

expression = expression ~ operator ~ expression ^^ {...}

you’ll get a stack overflow exception since you’re just recursing back into the same rule. However, Packrat parsers use lazy evaluation inbuilt in Scala to delay the evaluation till the result is required, allowing us to write such left recursive constructs.

Performance of Packrat Parsers

Lazy evaluation also provides memoization that makes sure that the matched rules are not applied again (by different paths), leading to linear time parsing at the expense of some memory. This also gives us infinite lookahead and supports a really large class of grammars. For the people who’re interested in knowing more, see Ford’s paper of Packrat parsing and Manohar’s implementation in Scala.

Operator Precendence

Operator precedence does not often lend itself to a clean solution. If you implement the parser simply you’d do it like thus:

// Lexer gives one kind of token for all binary operatorsdef operator: Parser[BIN_OP] = positioned {   val op1 = """(\*|%|/)""".r           ^^ { x ⇒ BIN_OP(x) }   val op2 = """(\+|>>|<<)""".r  ="" ^^="" {="" x="" ⇒="" bin_op(x)="" }=""  val="" op3="" "(="">=|<=|<|>|!=|==)""".r  ^^ { x ⇒ BIN_OP(x) }  val op4 = """(\|\||&&|and|or)""".r   ^^ { x ⇒ BIN_OP(x) }  op1 | op2 | op3 | op4}// A parser then parses all binary expression along with other expressionslazy val binOpExpression: PackratParser[Expression] = {  expression ~ operator ~ expression ^^ {    case (expr1: Expression) ~ BIN_OP(op) ~ (expr2: Expression) ⇒ BinOpExpression(op, expr1, expr2)  }}</=|<|></)""".r>

Parsing without precendence

You run this and quickly find the problem with precedence: the “+” incorrectly binds tighter than the “*”.

// Parsing: a + b * c;BinOpExpression(  "*",  BinOpExpression(    "+",    SimpleExpression(None, Some(VariableTerm("a"))),    SimpleExpression(None, Some(VariableTerm("b")))  )  SimpleExpression(None, Some(VariableTerm("c"))),)

AST with incorrect precendence

In recursive descent parsers, the standard solution to this is shown to be a scheme like this:

expression = ["+"|"-"] term {("+"|"-") term} .

term = factor {("*"|"/") factor} .

factor =
    ident
    | number
    | "(" expression ")"

This only works for a very simple case, but when you look at a real operator precedence such as the one for Java you see that there are 16 levels of unary and binary expressions that are to be interspersed with all the other expressions (eg. if-else, function calls) at each level.

For this blog, the parser we added shows 4 levels. We broke the overall structure of parsers using class inheritance:

/** Lexer split binary operator into multiple levels */class Lexer extends RegexParsers {  ...  def operator1: Parser[BIN_OP1]     = positioned { """(\*|%|/)""".r           ^^ { x ⇒ BIN_OP1(x) } }  def operator2: Parser[BIN_OP2]     = positioned { """(\+|>>|<<)""".r  ="" ^^="" {="" x="" ⇒="" bin_op2(x)="" }=""  def="" operator3:="" parser[bin_op3]="" """(="">=|<=|<|>|!=|==)""".r  ^^ { x ⇒ BIN_OP3(x) } }  def operator4: Parser[BIN_OP4]     = positioned { """(\|\||&&|and|or)""".r   ^^ { x ⇒ BIN_OP4(x) } }  ...}/** Overall parser structure */class ParserBase extends Parsers with PackratParsers {   // simple or low level parsers}class ParserExpression extends ParserBase {  // Expression Handling}class ParserTop extends ParserExpression {  // Statements, Functions - high level parsers}</=|<|></)""".r>

Building Precedence into Lexer and Parser structure

Now, we create multiple levels of expression parsers, each overriding binary operator parser per precedence level.

/***  * Reduced code snippet  * full code: https://github.com/SimpleDataLabsInc/parser_combinator/blob/master/src/main/scala/parser_combinator/compile/ParserExpression.scala  */class ParserExpression extends ParserBase {  // Internal class Expressions handles all expressions, including binOpExpression    private class Expressions(val previousExpressions: Option[PackratParser[Expression]]) {       lazy val expression = generateParser    lazy val ifElseExpression = { /* body here */ }    lazy val binOpExpression = positioned {      expression ~ operator4 ~ expression3 ^^ {        case (expr1: Expression) ~ BIN_OP4(op) ~ (expr2: Expression) ⇒           BinOpExpression(op, expr1, expr2)      }    }  }    // Then we create overrides of the expression parser, each with new binOpExpression    lazy val expression4 = new Expressions(Some(expression3)).expression  private lazy val expression3 = new Expressions(Some(expression2)) {    override lazy val binOpExpression = {      expression ~ operator3 ~ expression2 ^^ {        case (expr1: Expression) ~ BIN_OP3(op) ~ (expr2: Expression) ⇒          BinOpExpression(op, expr1, expr2)      }    }  }.expression  private lazy val expression2 = new Expressions(Some(expression1)) {    override lazy val binOpExpression = { /* body here */ }  }.expression  private lazy val expression1: PackratParser[Expression] = new Expressions(None) {    override lazy val binOpExpression: PackratParser[BinOpExpression] = { /* body here */ }  }.expression}

Precedence with parsers

Debugging Parsers

One thing that doesn’t get talked about much is debugging of parsers. Sometimes a small typo might lead to a rule not being applied where you thought it should’ve been. Here is the debug output, that shows the rules being tried, which ones failed and which ones matched.

/*DEBUG OUTPUTtrying assignStatement at scala.util.parsing.combinator.PackratParsers$PackratReader$$anon$3@3234e239assignStatement --> [5.3] failure: 'LET()' expected but RETURN() found  return e;  ^trying returnStatement at scala.util.parsing.combinator.PackratParsers$PackratReader$$anon$3@3234e239returnStatement --> [6.1] parsed: ReturnStatement(SimpleExpression(None,Some(VariableTerm(e))))*/// whenever a rule is matched, we log it if debugMatch is turned on lazy val statement: PackratParser[Statement] = positioned {  val assignStatement = {    LET() ~ variable ~ EQUAL() ~ expression4 ~ SEMI() ^^ {      case _ ~ (variable: VariableTerm) ~ _ ~ (expression: Expression) ~ _ ⇒        if (debugMatch) logger.info(s"PARSE: statement: Assign $variable = $expression")        AssignStatement(variable, expression)    }  }  val returnStatement = {    RETURN() ~ expression4 ~ SEMI() ^^ {      case _ ~ (expression: Expression) ~ _ ⇒        if (debugMatch) logger.info(s"PARSE: statement: Return $expression")        ReturnStatement(expression)    }  }    // we wrap parsers in dgb function, with name for each parser    dbg(assignStatement)(name = "assignStatement") |    dbg(returnStatement)(name = "returnStatement")}// debug function wraps the rules to debug into log functiondef dbg[T](p: => Parser[T])(name: String): Parser[T] = {  if (!debugRules) p else  name match {    case "assignStatement"    ⇒ log(p)(name)    case "returnStatement"    ⇒ log(p)(name)    case _ ⇒ p  }}// log function in  parser combinator library prints all attempts of the ruledef log[T](p: => Parser[T])(name: String): Parser[T] = Parser{ in =>  println("trying "+ name +" at "+ in)  val r = p(in)  println(name +" --> "+ r)  r}

Debugging parsers

Finally

We’ve shown how to very quickly develop parsers for real world DSLs using parser combinators. If you’re using these in the field and want to exchange notes, please reach out to us!