An introduction to Parsing Expression Grammars with LPeg
What is a PEG
A PEG, or Parsing Expression Grammar, is a way of describing a language (or pattern) for string matching. Unlike regular expressions, a PEG can parse an entire language, including recursive structures. A PEG is most similar in classification to context free grammars, which are implemented by tools like Yacc or Bison.
A grammar is a specification for a language. In practice we use a grammar to convert an input string of some language into objects in memory that we can manipulate.
PEGs have a very different implementation than context free grammars (CFGs). Instead of being reduced to a state machine, PEGs employ sequential parsing. This means that the order in which you write your parsing rules matters. I'll provide an example below. They can be slower than CFGs, but in practice they are quite fast. A PEG is conceptually very similar to a common hand-written parser pattern called recursive descent.
PEGs can be easier to write. You typically don’t need to write a separate scanner: your grammar works directly with the input text without a tokenization step.
Don’t worry if none of this makes sense, you'll get a feel for how things work throughout this guide.
What is LPeg
My first introduction to PEGs was with LPeg, a Lua implementation of the PEG algorithm. LPeg grammars are specified directly in Lua code. This is different than most other parsing tools which employ the compiler compiler pattern.
In the compiler compiler pattern you write a grammar in a custom domain specific language, then compile it to your target language. With LPeg you just write Lua. Each parsing unit (or pattern object) is a first class object in the language. It can be combined with all of Lua’s built in operators and control statements. This makes for a very powerful way to express grammars.
MoonScript’s grammar is an example of a larger LPeg grammar that parses an entire programming language.
Installing LPeg
You can install LPeg from luarocks.org:
luarocks install lpeg
Once installed you can require
the module:
local lpeg = require("lpeg")
Some simple grammars
LPeg works by providing you with a series of single and double letter named functions that convert Lua literals into pattern objects. Pattern objects can be combined to make more complex patterns, or invoked against a string to check for a match. The operators for pattern objects are overloaded to provide different ways to combine them.
For the sake of bevity, we'll assume lpeg
is imported in all the code examples:
local lpeg = require("lpeg")
I'll try to explain everything used for the examples in this guide, but for a full understanding of all the built in functions I recommend reading the LPeg documentation.
String equality
The simplest pattern we can make is one that checks if a string is equal to another string:
lpeg.P("hello"):match("world") --> no match, returns nil
lpeg.P("hello"):match("hello") --> a match, returns 6
lpeg.P("hello"):match("helloworld") --> a match, returns 6
By default, when there is a match, LPeg will return the number of characters consumed. This is fine if you just need to check if the match took place, but if you're trying to parse structure out of a string you'll have to use some of LPeg’s capturing functions.
A match will still succeed even if it doesn’t get to the end of the string. You can use
-1
to avoid this. I'll describe it below.
Combining patterns
The multiplication and addition operators are the two most commonly used overloaded operators for combining patterns.
- Multiplication can be though of as and, the left operand must match, then the right operand must match
- Addition can be though of as or, either the left operand matches, or the right operand must match
Both of these operators suggest order. The left operand is always checked before the right. Here are some examples:
local hello = lpeg.P("hello")
local world = lpeg.P("world")
local patt1 = hello * world
local patt2 = hello + world
-- hello followed by world
patt1:match("helloworld") --> matches
patt1:match("worldhello") --> doesn't match
-- either hello or world
patt2:match("hello") --> matches
patt2:match("world") --> matches
Regular Lua operator precedence rules apply to these operators, so you'll have to use parentheses when necessary.
Parsing numbers
With the basics down we can now write a grammar that does something. Let’s write a grammar that extracts all the integers out of an arbitrary string.
The algorithm will work as follows:
- For each character…
- If it’s a decimal character, start capturing
- Consume each char while it’s a decimal character
- Otherwise skip to next char
- If it’s a decimal character, start capturing
The way I like to approach writing a parser in LPeg is to write the most specific patterns first. Using those patterns as building blocks to assemble the final result. Luckily each pattern is a first class object in Lua with LPeg, so it’s very each to test each component in isolation.
First we'll write a pattern that parses an integer:
local integer = lpeg.R("09")^1
This pattern will match any character between 0
and 9
1 or more times. All
patterns in LPeg are greedy.
We want the value of the number as the return value, not the character offset
where the match ended. We can use the /
operator to apply a capture and
transformation function at once:
local integer = lpeg.R("09")^1 / tonumber
print(integer:match("2923")) --> The number 2923
This works by capturing the result of the pattern, "2923"
as a string, and
passing it to the Lua function tonumber
. The return value of match
is a
regular number value parsed from the string.
If a capture is used then the default return value is replaced with the values of the captures when calling
match
Now we write a parser that matches either an integer or some other character
local integer_or_char = integer + lpeg.P(1)
When using LPeg’s operator overloading, it will automatically cast any Lua literals into patterns by passing them to P. In the above example we could have just written
1
instead oflpeg.P(1)
The order is important here: had we written it the other way around,
lpeg.P(1)
would match any character, including integer characters. This would
prevent us from ever capturing an integer.
To complete our grammar we just need to repeat what we have, and store the
results of the captures into a table using Ct
:
local extract_ints = lpeg.Ct(integer_or_char^0)
Here’s the entire grammar and some examples if it in action:
local integer = lpeg.R("09")^1 / tonumber
local integer_or_char = integer + lpeg.P(1)
local extract_ints = lpeg.Ct(integer_or_char^0)
-- Testing it out:
extract_ints:match("hello!") --> {}
extract_ints:match("hello 123") --> {123}
extract_ints:match("5 5 5 yeah 7 7 7") --> {5,5,5,7,7,7}
A calculator expression parser
Next we're going to build a calculator expression parser. I put emphasis on parser because instead of evaluating the expression we'll build a syntax tree of the parsed expression.
If you're ever going to build a programming language you'll almost always parse to a syntax tree. Targeting a syntax tree for this calculator example is a good exercise.
Before writing any code we should define the language that can be parsed. It should parse integers, addition, subtraction, multiplication, and division. It should be aware of operator precedence. It should allow arbitrary whitespace between operators and numbers.
Here are some sample inputs (separated by newlines):
1*2
1 + 2
5 + 5/2
1*2 + 3
1 * 2 - 3 + 3 + 2
Next we design the format of the syntax tree: how can we map these parsed expressions into a Lua friendly representation?
For regular integers, we can map directly to an integer in Lua. For any binary expressions (addition, division, etc.), we'll use a Lisp style S-Expression arrays, where the first item in the array is the operator as a string, and the second and third items are the left and right hand side.
That’s a mouthful, an example is easier to digest:
parse("5") --> 5
parse("1*2") --> {"*", 1, 2}
parse("9+8") --> {"+", 9, 8}
parse("3*2+8") --> {"+", {"*", 3, 2}, 8}
The parse
function above will be the grammar we create.
With the specification out of the way, we can start writing the parser. Like before, we start as specific as possible:
local lpeg = require("lpeg")
local white = lpeg.S(" \t\r\n") ^ 0
local integer = white * lpeg.R("09") ^ 1 / tonumber
local muldiv = white * lpeg.C(lpeg.S("/*"))
local addsub = white * lpeg.C(lpeg.S("+-"))
In order to allow arbitrary whitespace I've made a whitespace pattern object,
white
, that I append to the front of all other pattern objects. Using this
approach, I can use pattern objects freely without having to think about if
whitespace has been handled.
The integer
pattern is reused from above, and the patterns matching the
operators are straight forward. I've grouped the operators into two different
patterns based on their precedence.
Before trying to write the entire grammar let’s focus on getting a single component working. I've opted to write parsing for integer or multiplication/division.
Program language creators will typically call the multiplication precedence level factor and the addition precedence level term. We'll use that terminology here.
local factor = integer * muldiv * integer + integer
factor:match("5") --> 5
factor:match("2*1") --> 2 "*" 1
What we have above works for multiplication, but there’s a problem. The capture (the return value in this case) of the pattern is wrong. It returns multiple values in the order: left, operator, right. What we want is a table, where the first item is the operator.
In order to fix this we'll create a transform function, the node constructor:
local function node(p)
return p / function(left, op, right)
return { op, left, right }
end
end
local factor = node(integer * muldiv * integer) + integer
factor:match("5") --> 5
factor:match("2*1") --> {"*", 2, 1}
Looks good, now we can start building the rest of the grammar using the node constructor.
Because we're building a recursive grammar, we'll use the grammar form of
lpeg.P
. Let’s re-write the above code in the grammar form:
local calculator = lpeg.P({
"exp",
exp = lpeg.V("factor") + integer,
factor = node(integer * muldiv * integer)
})
There are a couple new things here. The first item in the grammar’s table,
"exp"
, is the root pattern for our grammar. This means that the pattern named
exp
will be executed first.
lpeg.V
is used to reference non-terminals in our grammar. This is how we'll
make things recursive, by referencing patterns that aren’t declared yet. This
particular grammar isn’t recursive, but it still demonstrates how V
can be
used.
In PEGs we can’t use any recursion that would result in the parser entering an infinite loop. We need to be clever about how we structure our patterns in order to achieve the precedence we want.
Since the factor, multiplication and division, has the highest precedence, it should be deepest in our pattern hierarchy.
Let’s redesign our factor parser to handle the case where we have repeated multiplication/division:
local calculator = lpeg.P({
"exp",
exp = lpeg.V("factor") + integer,
factor = node(integer * muldiv * (lpeg.V("factor") + integer))
})
Right recursion is used to allow any number of multiplications to be chained. We can chain operators of the same precedence without any issues. This can now parse:
calculator:match("5*3*2") --> {"*", 5, {"*", 3, 2}}
We work our way to lower precedence until we get to the top of our grammar. Next is parsing the term:
local calculator = lpeg.P({
"exp",
exp = lpeg.V("term") + lpeg.V("factor") + integer,
term = node((lpeg.V("factor") + integer) * addsub * lpeg.V("exp")),
factor = node(integer * muldiv * (lpeg.V("factor") + integer))
})
Writing the term pattern is just a matter of considering on the scenarios that can happen. The left hand side can either be a higher precedence factor, or an integer. The right hand side can either be the same precedence term, or a higher precedence factor, or integer. (Note that these are listed in the order of their precedence.)
We were able to re-use the exp
pattern as the left hand side of the term
pattern, since it happened to match all the things we want there.
The final grammar, with the node
constructor:
local lpeg = require("lpeg")
local white = lpeg.S(" \t\r\n") ^ 0
local integer = white * lpeg.R("09") ^ 1 / tonumber
local muldiv = white * lpeg.C(lpeg.S("/*"))
local addsub = white * lpeg.C(lpeg.S("+-"))
local function node(p)
return p / function(left, op, right)
return { op, left, right }
end
end
local calculator = lpeg.P({
"input",
input = lpeg.V("exp") * -1,
exp = lpeg.V("term") + lpeg.V("factor") + integer,
term = node((lpeg.V("factor") + integer) * addsub * lpeg.V("exp")),
factor = node(integer * muldiv * (lpeg.V("factor") + integer))
})
I've added a
line
pattern that checks to make sure the parser got to the end of the input.
Closing
That’s it for this guide. Hopefully it’s enough get you started using LPeg in your projects. Grammars written in LPeg make a good alternative to Lua patterns or regular expressions as they are easier to read, debug, and test. Additionally they're powerful enough to implement parsers for entire programming languages!
In the future I hope to write more guides covering more advanced techniques that I picked up while implementing MoonScript.
Related projects
A programming language that compiles to Lua, inspired by CoffeeScript and written in MoonScript.