An introduction to Parsing Expression Grammars with LPeg

What is a PEG

A PEG, or Parsing Expression Grammar, is a way of describing a language (or pattern) for string matching. Unlike regular expressions, a PEG can parse an entire language, including recursive structures. A PEG is most similar in classification to context free grammars, which are implemented by tools like Yacc or Bison.

A grammar is a specification for a language. In practice we use a grammar to convert an input string of some language into objects in memory that we can manipulate.

PEGs have a very different implementation than context free grammars (CFGs). Instead of being reduced to a state machine, PEGs employ sequential parsing. This means that the order in which you write your parsing rules matters. I'll provide an example below. They can be slower than CFGs, but in practice they are quite fast. A PEG is conceptually very similar to a common hand-written parser pattern called recursive descent.

PEGs can be easier to write. You typically don’t need to write a separate scanner: your grammar works directly with the input text without a tokenization step.

Don’t worry if none of this makes sense, you'll get a feel for how things work throughout this guide.

What is LPeg

My first introduction to PEGs was with LPeg, a Lua implementation of the PEG algorithm. LPeg grammars are specified directly in Lua code. This is different than most other parsing tools which employ the compiler compiler pattern.

In the compiler compiler pattern you write a grammar in a custom domain specific language, then compile it to your target language. With LPeg you just write Lua. Each parsing unit (or pattern object) is a first class object in the language. It can be combined with all of Lua’s built in operators and control statements. This makes for a very powerful way to express grammars.

MoonScript’s grammar is an example of a larger LPeg grammar that parses an entire programming language.

Installing LPeg

You can install LPeg from luarocks.org:

luarocks install lpeg

Once installed you can require the module:

local lpeg = require("lpeg")

Some simple grammars

LPeg works by providing you with a series of single and double letter named functions that convert Lua literals into pattern objects. Pattern objects can be combined to make more complex patterns, or invoked against a string to check for a match. The operators for pattern objects are overloaded to provide different ways to combine them.

For the sake of bevity, we'll assume lpeg is imported in all the code examples:

local lpeg = require("lpeg")

I'll try to explain everything used for the examples in this guide, but for a full understanding of all the built in functions I recommend reading the LPeg documentation.

String equality

The simplest pattern we can make is one that checks if a string is equal to another string:

lpeg.P("hello"):match("world") --> no match, returns nil
lpeg.P("hello"):match("hello") --> a match, returns 6
lpeg.P("hello"):match("helloworld") --> a match, returns 6

By default, when there is a match, LPeg will return the number of characters consumed. This is fine if you just need to check if the match took place, but if you're trying to parse structure out of a string you'll have to use some of LPeg’s capturing functions.

A match will still succeed even if it doesn’t get to the end of the string. You can use -1 to avoid this. I'll describe it below.

Combining patterns

The multiplication and addition operators are the two most commonly used overloaded operators for combining patterns.

  • Multiplication can be though of as and, the left operand must match, then the right operand must match
  • Addition can be though of as or, either the left operand matches, or the right operand must match

Both of these operators suggest order. The left operand is always checked before the right. Here are some examples:

local hello = lpeg.P("hello")
local world = lpeg.P("world")

local patt1 = hello * world
local patt2 = hello + world


-- hello followed by world
patt1:match("helloworld") --> matches
patt1:match("worldhello") --> doesn't match

-- either hello or world
patt2:match("hello") --> matches
patt2:match("world") --> matches

Regular Lua operator precedence rules apply to these operators, so you'll have to use parentheses when necessary.

Parsing numbers

With the basics down we can now write a grammar that does something. Let’s write a grammar that extracts all the integers out of an arbitrary string.

The algorithm will work as follows:

  • For each character…
    • If it’s a decimal character, start capturing
      • Consume each char while it’s a decimal character
    • Otherwise skip to next char

The way I like to approach writing a parser in LPeg is to write the most specific patterns first. Using those patterns as building blocks to assemble the final result. Luckily each pattern is a first class object in Lua with LPeg, so it’s very each to test each component in isolation.

First we'll write a pattern that parses an integer:

local integer = lpeg.R("09")^1

This pattern will match any character between 0 and 9 1 or more times. All patterns in LPeg are greedy.

We want the value of the number as the return value, not the character offset where the match ended. We can use the / operator to apply a capture and transformation function at once:

local integer = lpeg.R("09")^1 / tonumber

print(integer:match("2923")) --> The number 2923

This works by capturing the result of the pattern, "2923" as a string, and passing it to the Lua function tonumber. The return value of match is a regular number value parsed from the string.

If a capture is used then the default return value is replaced with the values of the captures when calling match

Now we write a parser that matches either an integer or some other character

local integer_or_char = integer + lpeg.P(1)

When using LPeg’s operator overloading, it will automatically cast any Lua literals into patterns by passing them to P. In the above example we could have just written 1 instead of lpeg.P(1)

The order is important here: had we written it the other way around, lpeg.P(1) would match any character, including integer characters. This would prevent us from ever capturing an integer.

To complete our grammar we just need to repeat what we have, and store the results of the captures into a table using Ct:

local extract_ints = lpeg.Ct(integer_or_char^0)

Here’s the entire grammar and some examples if it in action:

local integer = lpeg.R("09")^1 / tonumber
local integer_or_char = integer + lpeg.P(1)
local extract_ints = lpeg.Ct(integer_or_char^0)

-- Testing it out:

extract_ints:match("hello!") --> {}
extract_ints:match("hello 123") --> {123}
extract_ints:match("5 5 5 yeah 7 7 7") --> {5,5,5,7,7,7}

A calculator expression parser

Next we're going to build a calculator expression parser. I put emphasis on parser because instead of evaluating the expression we'll build a syntax tree of the parsed expression.

If you're ever going to build a programming language you'll almost always parse to a syntax tree. Targeting a syntax tree for this calculator example is a good exercise.

Before writing any code we should define the language that can be parsed. It should parse integers, addition, subtraction, multiplication, and division. It should be aware of operator precedence. It should allow arbitrary whitespace between operators and numbers.

Here are some sample inputs (separated by newlines):

1*2
1 + 2
5 + 5/2
1*2 + 3
1 * 2 - 3 + 3 + 2

Next we design the format of the syntax tree: how can we map these parsed expressions into a Lua friendly representation?

For regular integers, we can map directly to an integer in Lua. For any binary expressions (addition, division, etc.), we'll use a Lisp style S-Expression arrays, where the first item in the array is the operator as a string, and the second and third items are the left and right hand side.

That’s a mouthful, an example is easier to digest:

parse("5") --> 5
parse("1*2") --> {"*", 1, 2}
parse("9+8") --> {"+", 9, 8}
parse("3*2+8") --> {"+", {"*", 3, 2}, 8}

The parse function above will be the grammar we create.

With the specification out of the way, we can start writing the parser. Like before, we start as specific as possible:

local lpeg = require("lpeg")
local white = lpeg.S(" \t\r\n") ^ 0

local integer = white * lpeg.R("09") ^ 1 / tonumber
local muldiv = white * lpeg.C(lpeg.S("/*"))
local addsub = white * lpeg.C(lpeg.S("+-"))

In order to allow arbitrary whitespace I've made a whitespace pattern object, white, that I append to the front of all other pattern objects. Using this approach, I can use pattern objects freely without having to think about if whitespace has been handled.

The integer pattern is reused from above, and the patterns matching the operators are straight forward. I've grouped the operators into two different patterns based on their precedence.

Before trying to write the entire grammar let’s focus on getting a single component working. I've opted to write parsing for integer or multiplication/division.

Program language creators will typically call the multiplication precedence level factor and the addition precedence level term. We'll use that terminology here.

local factor = integer * muldiv * integer + integer

factor:match("5") --> 5
factor:match("2*1") --> 2 "*" 1

What we have above works for multiplication, but there’s a problem. The capture (the return value in this case) of the pattern is wrong. It returns multiple values in the order: left, operator, right. What we want is a table, where the first item is the operator.

In order to fix this we'll create a transform function, the node constructor:

local function node(p)
  return p / function(left, op, right)
    return { op, left, right }
  end
end

local factor = node(integer * muldiv * integer) + integer

factor:match("5") --> 5
factor:match("2*1") --> {"*", 2, 1}

Looks good, now we can start building the rest of the grammar using the node constructor.

Because we're building a recursive grammar, we'll use the grammar form of lpeg.P. Let’s re-write the above code in the grammar form:

local calculator = lpeg.P({
  "exp",
  exp = lpeg.V("factor") + integer,
  factor = node(integer * muldiv * integer)
})

There are a couple new things here. The first item in the grammar’s table, "exp", is the root pattern for our grammar. This means that the pattern named exp will be executed first.

lpeg.V is used to reference non-terminals in our grammar. This is how we'll make things recursive, by referencing patterns that aren’t declared yet. This particular grammar isn’t recursive, but it still demonstrates how V can be used.

In PEGs we can’t use any recursion that would result in the parser entering an infinite loop. We need to be clever about how we structure our patterns in order to achieve the precedence we want.

Since the factor, multiplication and division, has the highest precedence, it should be deepest in our pattern hierarchy.

Let’s redesign our factor parser to handle the case where we have repeated multiplication/division:

local calculator = lpeg.P({
  "exp",
  exp = lpeg.V("factor") + integer,
  factor = node(integer * muldiv * (lpeg.V("factor") + integer))
})

Right recursion is used to allow any number of multiplications to be chained. We can chain operators of the same precedence without any issues. This can now parse:

calculator:match("5*3*2") --> {"*", 5, {"*", 3, 2}}

We work our way to lower precedence until we get to the top of our grammar. Next is parsing the term:

local calculator = lpeg.P({
  "exp",
  exp = lpeg.V("term") + lpeg.V("factor") + integer,
  term = node((lpeg.V("factor") + integer) * addsub * lpeg.V("exp")),
  factor = node(integer * muldiv * (lpeg.V("factor") + integer))
})

Writing the term pattern is just a matter of considering on the scenarios that can happen. The left hand side can either be a higher precedence factor, or an integer. The right hand side can either be the same precedence term, or a higher precedence factor, or integer. (Note that these are listed in the order of their precedence.)

We were able to re-use the exp pattern as the left hand side of the term pattern, since it happened to match all the things we want there.

The final grammar, with the node constructor:

local lpeg = require("lpeg")
local white = lpeg.S(" \t\r\n") ^ 0

local integer = white * lpeg.R("09") ^ 1 / tonumber
local muldiv = white * lpeg.C(lpeg.S("/*"))
local addsub = white * lpeg.C(lpeg.S("+-"))

local function node(p)
  return p / function(left, op, right)
    return { op, left, right }
  end
end

local calculator = lpeg.P({
  "input",
  input = lpeg.V("exp") * -1,
  exp = lpeg.V("term") + lpeg.V("factor") + integer,
  term = node((lpeg.V("factor") + integer) * addsub * lpeg.V("exp")),
  factor = node(integer * muldiv * (lpeg.V("factor") + integer))
})

I've added a line pattern that checks to make sure the parser got to the end of the input.

Closing

That’s it for this guide. Hopefully it’s enough get you started using LPeg in your projects. Grammars written in LPeg make a good alternative to Lua patterns or regular expressions as they are easier to read, debug, and test. Additionally they're powerful enough to implement parsers for entire programming languages!

In the future I hope to write more guides covering more advanced techniques that I picked up while implementing MoonScript.

Related projects

A programming language that compiles to Lua, inspired by CoffeeScript and written in MoonScript.