It is time for another diversion, but it is leading towards the language design – honest.
LR grammars are a tool for specifying the grammar for a language, and it is fairly easy to automatically generate a parsing tool from a grammar. So they have often been used for this purpose.
There seems to be some suggestion that LR is no longer to tool of choice (see wikipedia – since redacted), apparently because it is hard to do good error reporting. The gcc compilers converted from an LR grammar to a recursive descent LL grammar, apparently for this reason.
However I like LR grammars and I don’t see the problem. So I plan to start out with an LR grammar approach. Either it will work well, or I will start the see the problem, and either outcome in positive in my view. So I cannot lose.
There are plenty of tools available for converting an LR grammar into code, of which GNU “bison” is probably best know, but I’m going to write my own. Partly this is because it will help me learn, and partly because it will add to the collection of code that I have recent experience with so that when Ocean starts to take shape I will have lots of example code to try translating to it. But mostly because it is fun.
So this post is about LR grammars and building a tool to parse them. You’ve probably read all this before elsewhere, so I suggest you stop reading now and go do something useful like play sudoko. Come back in a few weeks when I will have moved on.
However, it is also possible that you have tried to read about this and, like me, failed to find anything really accessible. There are academic papers that are full of dense mathematics which makes an interesting topics appear unapproachable. And there are lecture notes which seem to skim over the really interesting details. And of course there is wikipedia, which isn’t exactly bad, but isn’t exactly good either. If that is you, then please read on and see if my contribution helps you at all.
LR Grammar Basics
A grammar is a set of rules for mapping between structured data and a linear representation of that data. The grammar can allow you to “produce” a linear form from a structure, or “parse” a linear form to recover the structure.
We use grammar whenever we convert thoughts and idea to sentences and stories. We use the same grammar in reverse when we convert sentences and stories that we hear and read back into thoughts and ideas. Of course the grammar of natural language is very complex and context-dependent, and each individual works with a slightly different version of the grammar. So the conversion of ideas to and from natural language is imperfect and error prone. It is however extremely powerful. It allows me to write this note, and you to get at least some of my ideas while reading it.
For computer languages we use much simpler but much more precise grammars. Noam Chomsky famously did a lot of work in formalising many ideas of grammar and it is him we thank for the “context free” grammars that work so well for programming languages.
A context free grammar (CFG) consists of a set of symbols and a set of productions. The symbols are grouped into two subset: the terminal symbols and the non-terminal symbols. There is one distinguished non-terminal symbol called the “START” symbol. The grammar describes a “language” which is the set of all possible “sentences” which in turn are lists of terminal symbols. It does this by giving rules for producing parts of sentences. Any sentence which can be produced following these rules is a valid sentence in the language. All other sentences are invalid. To avoid confusion, we normally reserve the word “produce” for when a single symbol produces a series of other symbols. Repeating this multiple times is a “derivation”. A complete derivation starts with the “START” symbol and results in a list of terminal symbols.
Terminal symbols are the elements of a sentence. These could be strokes of a pen, letters, words, or conceptual groups of words such as “noun” and “verb”. For the purposes of computer languages we normally work with “tokens” as terminal symbols where tokens are the things extracted by the scanner described in my previous note.
A grammar takes a sequence of tokens and groups them into semantic units such as expressions and statements. It essentially describes how to extract structure from a linear sequence of tokens.
“Tokens” here are little more than opaque objects represented by numbers. There is often extra detail combined with the number, but the grammar doesn’t care about that. If the grammar needs to know the difference between two things, they need to be different tokens. If it doesn’t then maybe they don’t need to be (though it doesn’t hurt).
The grammar describes how to gather sets of tokens (aka terminal symbols) together as what might be called “virtual tokens”. So the set of tokens “NUMBER PLUS NUMBER” might become a virtual token “EXPRESSION”. It can then be included in some other grouping of tokens. These “virtual tokens” are the things referred to earlier as “non-terminal symbols”.
A grammar uses these terminal and non-terminal symbol in the set of “productions”. Each production declares that some non-terminal symbol may produce a given list of symbols (whether terminal or non-terminal).
This terminology (‘start’ and ‘production’ and ‘terminal symbol’) doesn’t make much sense if you want to parse a program, but it does if you want to generate a program, or “sentence” as described earlier. From the perspective if generation, a symbol is “terminal” if it cannot generate anything more, and non-terminal if it can.
When we talk about parsing, we use the same terminology because that is what everyone else does. I would prefer “real tokens” for terminal symbols, “virtual tokens” for non-terminal symbols, “reductions” for productions and “goal token” for start symbol. But you gotta work with what you have.
LR parsing basics.
The process described above for generating sentences can process symbols in any order. Once you have more than one non-terminal you have a choice. You can produce something from the first, or from the second. So you could build up a sentence from the beginning, or from the end. When parsing you need to be more focused.
The “L” in “LR” means that parsing progresses from Left to right. You certainly could parse in some different direction, but the language you design for that sort of parsing would end up being very different. An “LR” grammar converts the leftmost groups of terminals into fewer non-terminals first and then progresses to include more and more terminals moving to the right.
We will get to the “R” shortly. I suspect that originally “LR” meant “Left to Right”. But there are two distinct ways to parse from Left to Right, one is recursive descent which requires a different sort of grammar to that described here – an LL grammar. For LR grammars you use a bottom-up shift/reduce parser.
This shift/reduce parser was invented by Donald Knuth and involves a state machine – with a bunch of states – a stack (of symbols and states) and two actions named “SHIFT” and “REDUCE”. One particular instance of “REDUCE” – the one that involves the START symbol – is sometimes referred to as “ACCEPT”. I see no value in making such a strong special case of that, so I keep just “SHIFT” and “REDUCE”.
If we think back to the previous description of deriving a sentence from the start symbol using the grammar, and consider the intermediate stages between “START” and the final sentence, then those lists of symbols (sometimes call “sentential forms”) are represented in the parser by the stack combined with the remainder of the input. The symbols from the bottom of the stack through to the top of the stack, followed by the “next” symbol to read from input through to the last symbol in the input correspond exactly to the sequence of symbols in some intermediate step when deriving the program.
The “production” step in sentence generation is exactly mirrored in the “REDUCE” step in parsing. Specifically if the top symbols on the stack match the symbols in the body of some production, then we might pop them off and replace them with the non-terminal which is the ‘head’ of that production.
The “SHIFT” step simply involves moving one terminal symbol from the input onto the stack. Parsing proceeds by shifting terminal symbols onto the stack until we have a set of symbols that can usefully be reduced, then reducing them. This continues until the entire program has been read (or “SHIFTed”). On each “REDUCE” step the parser will record information about the symbols that were popped so as to build up an abstract syntax tree which records the structure of the program. So each entry on the stack holds not just a symbol and a state, but also some record of what was parsed down to that symbol.
At every “REDUCE” step, the non-terminal that results is the right-most non-terminal in the interim sentence (because it is on the top of the stack, and everything further to the right is terminals still to be read). So if the series of steps that result from LR parsing were played in reverse as derivation, each step would involve producing something from the right-most non-terminal. This is where the “R” in “LR” comes from. The parse is the reverse of a “Rightmost derivation” from the grammar. This contrast with “LL” grammars where the parse follows a “Leftmost derivation”.
There is an important word above we needs special attention. It is “usefully” as in “until we have a set of symbols that can usefully be reduced”. It is important that after we perform the REDUCE step, the resulting interim sentence can still be derived in the language. So we always need to know what symbols are permitted at each stage in the parse. This is what the “state” of the state machine provides us with.
A “state” really belongs between two symbols. The “current” state is between the symbol on the top of the stack, and the next symbol to be read from the input. When we push a symbol onto the stack, we push the current state with it, so that when that symbol is later popped, we can collect the state that preceded it too.
LR parsing step by step
Now that we have the big picture, the best way to understand the details is to watch how the parsing works.
The data that we have are:
- The grammar, as a set of productions. Each production has a ‘head’ which is a non-terminal symbol, and a ‘body’ which is zero or more symbols, some may be terminal, others non-terminal. Each non-terminal may be the head of multiple productions, but must be the head of at least one.
- An ‘action’ table which is a mapping from a state plus a lookahead to an action.
The ‘state’ we have already introduced and will be described in more detail shortly. The “lookahead” is the next terminal to be read. As we need to be able to look at the next symbol, the parser is said to have a look-ahead of ‘1’ and is sometimes written “LR(1)”.
The ‘action’ is either “SHIFT”, or “REDUCE(production)”. i.e. the “REDUCE” action must identify which production is being reduced. If the production has “START” as its head, then it is sometimes referred to as “ACCEPT”. Once the ACCEPT action happens we have parsed a complete sentence and must have finished. - A “go to” table. This is a mapping from state and symbol to a new state. Whenever we push a symbol, whether due to a SHIFT or a REDUCE, we perform a lookup on the current state for the pushed symbol and that gives us the new state.
Given these data the parsing process is quite straight forward.
- Start with an empty stack, with the “0” state current, and with the first symbol from the input in the look-ahead buffer.
- Look in the ‘action’ table for the current state and the symbol in the look-ahead buffer.
- If the action is “SHIFT”, push the token and the current state onto the stack, look up the shifted symbol in the go to table for the current state, and save the result as the current state. Then go back to step 2. Note that we don’t actually need to store the symbol in the stack, but it can be useful for tracing.
- If the action is “REDUCE”, see how many symbols are in the body of the reduced production, and pop that many entries off the stack. The state stored with the last entry popped becomes the current state. The entries popped must match the symbols in the production – the states will guarantee that. This is why we don’t really need to store the symbols.
Having popped those entries, we do any production-specific processing to build an abstract syntax tree and push the result on to the stack together with the current state (which was updated by the ‘pop’).
Finally we look up the current state and the head of the reduced production in the go to table and set the new state. Then if that reduced symbol was ‘start’ we finish, else we go back to step 2. - But what if the action table doesn’t contain an entry for the current state and the symbol in the look-ahead buffer? That means there is an error. The symbol is “unexpected”. We have to invoke the error handling procedure.
Error handling in an LR grammar
The easiest way to handle errors is just to throw our hands up in horror, report the error, and abort. But that is not always most useful. The preferred approach is to try to resynchronize with the grammar by throwing away a few symbols. Often the error will be localized and if we can just get past it we can start doing useful parsing again.
Some writers suggest synthesizing symbols as well as discarding them. For example you might guess that a semicolon was missing and so insert it. I don’t believe that idea adds anything useful. As you will see we do synthesise a single symbol which is the “error” token. Various error productions can be used to transform this into any nonterminal which may be needed. So while symbols might be inserted as well as deleted, that is just a side effect of the error productions in the grammar. The parser mainly focusses on discarding unusable symbols.
At the point where we detect an error there are two obvious symbols that could be discarded. We could discard the symbol in the look-ahead buffer, or the symbol on the top of stack. Or both. Or several in each direction. Ultimately we want to get to a situation where the state, possibly popped off the stack, has an action for the symbol in the look-ahead buffer. Choosing what to discard to get there cannot be done reliably without assistance from the language designer.
The language designer should “know” what the key structural elements and syntactic flags are. Some symbols will occur often and have a strong meaning, so aiming for one of those will be most likely to get back on track properly. The language designer can provide this information to the parser by specifying “error production”.
“Error” is a special terminal symbol that would not normally be found in the input stream. However it can appear in the body of a production and this makes it an “error production”.
A simple example is:
statement -> ERROR ";"
This says that an error followed by a semicolon can reasonably take the place of a statement. i.e. a semicolon is a strong syntactic symbol that nearly always means the end of a statement, even if there is garbage before it.
With a number of these error productions in the grammar, the task of error handling becomes quite easy.
Firstly the parser pops entries off the stack until the current state has an ACTION for ‘ERROR’. That mean it is a state which can cope with an error production. In the above example, it is a state which is expecting a statement.
Then the parser shifts in the ERROR symbol (updating the state with the appropriate ‘go to’) and then starts discarding input symbols until it finds one for which the current state has an action. At this point the parser is synchronized again and it can go on to perform the appropriate action as steps 2 and later above.
Building the tables
All that remains is to build the state table and the action table. These will require manipulating various sets, so you better dig something suitable out of your ADT library.
State and the go to table.
The purpose of the ‘state’ is to be a record of how we got to where we are. As there are a collection of states on the stack, the current state doesn’t need to be a complete record – a partial record can be sufficient. So as we build a description of each state we may need to throw away information. When we do that we must acknowledge it and be clear why that information can be discard.
“where we are” is at some point in some production. We have (hopefully) already seen some of the symbols in the body of some production, and we may have some more that we are expecting. However we could concurrently be at different points in different productions. This can happen in two ways.
Firstly, and most simply, if we are in the body of some production just before the non-terminal “x”, then we are simultaneously at the start of every production with “x” as its head.
Secondly, we may not yet know which of several productions we are in. There could easily be two or more productions that start with some set of symbols, but finish differently. While we are in that common prefix we need to assume that we are in all of the productions at once, and as more symbols become apparent, the number of possible productions we are in will decrease until it is just one. That will be about when we REDUCE the production into a non-terminal.
To model these ideas we define a “state” to be a collection of “items”. An item is a particular production combined with an offset in the body of that production. The offset is sometimes referred to as “DOT” so an item might be represented as:
expression -> expression "+" . term
This is a production which says that an “expression” can produce an “expression” followed by a “+” and then a “term” Here the production has been made into an item by placing a “dot” which you can see between the “+” and the “term”.
This item says that we might be in the process of parsing an expression and have seen an expression and a plus sign. Of course we might be somewhere else where a “+” can follow an expression too, depending on what other items are in the state.
So a state is a collection of these items. They are all the possible positions in a derivation that we could be in. For all items in a state, the symbols before DOT will be the same, unless DOT is at the start of the item. Other symbols might be different.
There are three different sorts of item, each important in their own way. They are distinguished by what follows the DOT.
A “completed item” has nothing following the DOT. DOT is at the end of the production. A “terminal item” has a terminal following the DOT. A “non-terminal item” has (you guessed it) a non-terminal symbol following the DOT. See if you can guess how each of these is used?
We find all possible states by a repetitive process which finds all possible “next” states from a given state, and keeps doing this to the known states until no more new states can be found.
The starting point of this search is the state formed from all productions that have the “start” symbol as the head. For each such production (normally there is only one for simplicity) we form an item with DOT at the start. This collection of items form the “zero” state, which is the state we start at when parsing.
Given a state we find other states by following a 2-step process.
Firstly we “complete” the state. For every non-terminal item in the state, we find all productions with that non-terminal as the head and add an item for each such production with DOT at the start (if that item isn’t already present). This effects the first way that we can have multiple items in a state.
Once we have added all possible “DOT-at-the-start” items we make a list of all symbols which follow “DOT” in any production. For each of these symbols we can potentially make a new state (though it is possible the state already exists).
To make a new state given a completed state and a symbol we discard all items for which DOT is not followed by the given symbol, and advance DOT forward one for all items where DOT is followed by the given symbol. If this state has not already been added to the state list, it must be added, otherwise the pre-existing state with the same set of items must be found.
For each symbol that produces a state from our completed state, we add an entry to the go to table, from the old state via the symbol to the “new” state.
As noted we discard items when creating a new state. This must be justified.
All the discarded terminal and non-terminal items will still appear in some other state which is created from the current one. So they aren’t lost, just placed elsewhere. This reflects that fact that as we consume more symbols from the input we narrow down the range of possible parses, so we must naturally narrow the breadth of the state.
The completed items which are discard are different. No matter which new state we go to, all of those items will be gone, so the history they record of how we got here will be gone too. However we still have it further down the stack. If the current state has dot at the end of a production, then the state on top of the stack will have dot before the last symbol in that same production, and the state below that will have dot before the second last symbol, and so now. Somewhere there will be DOT at the start of that production, and in that state will be recorded the production which lead to this one. And so up the production chain such that the first state which records the START production will still be on the stack, at the very bottom. So dropping these doesn’t loose any information as it is still on the stack.
Once we have applied the above 2-step process to all states, and doing it to the last state didn’t create any new state, we have a full set of states, and we have a complete go to table.
Hopefully the process of creating these states makes sense in the context of how the go to table is used during parsing. A single state records all the productions that could be happening at each point in the parse so far, and the set of states in the stack form a complete record of where we are up to, and what can happen next.
In particular, by examining the current state we can know what actions are possible at each different point. This information is used to construct the “action” table.
The Action table
To produce the action table for a particular state, we start by examining each of the items in that state. We have already done all that we can with the “non-terminal” items, so it is just the “terminal” and “completed” items that interest us now.
For each terminal item a reasonable action is to SHIFT the next input symbol if that symbol is the symbol after DOT. So all the symbols that appear after DOT are added to the action table with “SHIFT” as their action.
Note that it isn’t really necessary to store these in the action table. The “go to” table for that state has exactly the same set of terminal symbols in it (no more, no less), so we could instead look for the look-ahead symbol in the “go to” table. If it is present we assume the “SHIFT” action.
For each completed item, a reasonable action would be to REDUCE that particular production. But we don’t currently know how to store these in the action table – which look-ahead symbol to associate them with. Also it might be appropriate to REDUCE a production rather than SHIFT a new symbol.
It is in the different ways to address this question that the different flavours of LR grammars become apparent. So we will address a few of them individually.
LR(0) grammars
LR grammars are sometimes called LR(k) grammars where “k” is the number of symbols that we look-ahead in order to decide the next action. Most often LR(1) grammar are used.
An LR(0) grammar in one in which no look ahead is done. For this to work there must be no questions raised by any state. That is, a state may either have exactly one completed items, or some number of terminal items. It may never have both and may never have two different completed items, as these would require a decision and we have no look ahead and so no basis to make a decision.
LR(0) grammars are too simple to be really interesting. They require every production to end with a terminal which doesn’t occur anywhere but at the end of productions. So if every production was enclosed in brackets that would work. However that nearly limits LR(0) to Lisp, and Lisp doesn’t really need a shift/reduce parser at all.
So LR(0) grammars are interesting primarily as a starting point, not for practical use.
LR(0.5) grammars
This is a label that I made up myself, as I think it describes a useful set of grammars but I have not seen it described elsewhere. Where an LR(0) grammar uses no look-ahead, an LR(0.5) grammar uses one symbol of look-ahead, but only half the time. Look ahead can guide the choice whether to SHIFT or REDUCE, but not the choice of which production to reduce.
So for a grammar to be LR(0.5), each state may have at most one completed item in its item set. When that is the case, all terminal items cause a SHIFT entry for the appropriate terminal to be added to the Action table, and all other possible entries in the Action table are set to REDUCE the one completed item. If there is no completed item then all other entries in the Action table lead to error handling.
As before, you wouldn’t actually bother storing this Action table. You would just record the one REDUCE action as a default. Then if the look-ahead symbol is present in the go to table, the parser performs SHIFT, otherwise it performs the default REDUCE or triggers an error.
I have seen it suggested that a weakness of this approach is that it doesn’t detect an error at the earliest possible moment. We may have to wait for several reductions to happen until we hit a state with no possible reductions, and which does not have a ‘go to’ for the current look-ahead token. Again, I don’t really understand this logic – those reductions don’t consume any input symbols so the error is still detected at the correct place in the input text. But maybe I’ll understand better once I’ve had some more experience, so stay tuned.
In my experiments with a grammar for Ocean, I’ve found that I never have two completed items in the one state. So a grammar such as this could well be sufficient to parse the language I will end up with.
This highlights as important aspect of my focus here. I don’t particularly need a grammar parser that is suitable for any language I get handed. I just need a parser which will handle the language that I will develop. And I plan to develop a language with a very simple and regular syntax – one that is easy enough for a mere human to understand. So I don’t want a lot of complexity. So while LR(0.5) may not be good enough for a lot of use cases, it could still be good enough for me.
However it isn’t quite. Not because my language won’t be simple, but because LR(0.5) cannot tell me exactly how simple my language is.
This is a bit like the fact that a lot of syntax in programming language isn’t needed for correct programs. It is only needed for incorrect programs to help the programmer see exactly where it is incorrect. Similarly if I get my grammar right, then LR(0.5) should be enough. But if I get it wrong, I may not know if I only do an LR(0.5) analysis of it.
So I need more power in my grammar analysis and so must move on.
Precedence LR Grammars
This is a bit of an aside as Precedence grammars don’t help detect errors and inconsistencies in a grammar, they only help to fix them. However they are useful to understand and as this is where they fit in the progression of complexity, this is where I will discuss them.
The idea with a precedence grammar is that the designer can make extra statements about different productions, giving them different precedence. This can apply in two different ways.
Firstly it can be used to choose between two completed items in a state. If we are trying to treat a grammar as LR(0.5) and find that some state has two completed items, we can “fix” the grammar by specifying a precedence of the various productions. I’d like to give an example here but it would end up being a toy example of a completely useless grammar and so wouldn’t really help. I personally cannot see why you would ever want a grammar which had two completed items in the same state. It means that some sequence of input tokens could be treated as one thing or as another depending only on what comes next. That sounds a like a design mistake to me. Maybe I’ll eat my words later, but for now this means I cannot find a useful example – sorry.
The other way that precedence can be useful is much easier to explain. It concerns the choice between SHIFT and REDUCE. As noted above, an LR(0.5) will always SHIFT when it can, and only reduce when it cannot. That might not be what you want.
Here there is a classic example: expression parsing. Consider this grammar.
expression -> expression "+" expression | expression "*" expression
A state in this grammar could have an item containing
expression -> expression "*" expression . expression -> expression . "+" expression
That is, we have parsed an expression, a multiplication symbol and another expression, and how we are faced with a choice of REDUCING the multiplication to a single expression, or SHIFTing in a plus sign.
The mathematically standard thing to do would be to REDUCE as multiplication has higher precedence than addition. However the LR(0.5) grammar parser would shift by preference which would produce the “wrong” result. So this grammar is “wrong”, but is easily fixed by giving the multiplication production a higher precedence than the addition production. The grammar processing step would then notice that a REDUCE step has a higher precedence than the SHIFT, and would remove the SHIFTed symbol from the “go to” table.
In order to make repeated use of the same operations (e.g. “a + b + c”) group correctly (as “(a + b) + c”) we would also want to reduce an addition before shifting another plus sign. So a production in a “completed item” might need a higher precedence than the same production in a “terminal item”. To get a completely general solution, we required each precedence level to be left-associative or right-associative. If the former, then reduce beats shift. If the latter then shift beat reduce.
These precedences can be useful, but they aren’t really required. The same effect can be achieved with more non-terminals. For example
expression -> expression "+" term | term term -> term "*" factor | factor
By giving different names to things that are added and things that are multiplied, and by putting different names on either side of the operations, the precedence and associativity can be specified implicitly and with complete generality.
Which of these forms you prefer is very much a subjective issue. There is little practical difference, so it is best to choose the one that makes most sense. I think I prefer the latter, though the former is briefer and that is not insignificant.
The one practical difference that exists is that the second form involves more REDUCE steps during parsing. Reducing
term -> factor
doesn’t really do much, but is a step which is completely unnecessary in the first grammar which needed precedence to disambiguate. So if you find your parser spending too much time reducing trivial productions, it might be useful to re-write it in a form that doesn’t require them.
SLR(1) Grammars
So far we have looked at what we can do with just the “go to” tables and some simple conventions or annotations. But it is possible to extract more details from the grammar itself and see if we can resolve some potential conflicts automatically.
The first of these is called “SLR” for “Simple Left-Right” grammar.
SLR(1) adds the computation of “FOLLOW” sets. For each non-terminal “s”, “FOLLOW(s)” is the set of all terminals which could possibly follow “s” anywhere in any sentential form derived by the grammar. This is then used to build the ACTION tables.
When we have a completed item, we look at the symbol in the head of that item, find the FOLLOW set for that symbol, and for every terminal in the FOLLOW set, add an Action to REDUCE that production if the given terminal is the look-ahead symbol.
This will often clearly identify which symbols should trigger a reduction (and which reduction to trigger), and which symbols should be shifted. Consequently it also identifies which symbols are in error.
If there is any over-lap between the FOLLOW sets of reducible productions or between a FOLLOW set and the SHIFTable terminals, then the grammar is said to not be SLR(1). If there is no over-lap, the grammar is fine and we have a well defined action table.
Computing the FOLLOW sets requires three simple steps.
Firstly we flag all the symbols that can produce an empty string. Normally relatively few symbols can do this, though any time you have a list that could be empty, you will need a “nullable” symbol (as these are called). These are found by an iterative process.
Firstly we flag all the non-terminals which are the heads of productions with no symbols on the right hand side. Then we flag all non-terminal were all the symbols on the right hand side are flagged. If there were any, then we repeat the process again and again until we cannot flag any more non-terminals as nullable. Then we can stop.
Having done this, we now proceed to the second step which is to build a FIRST set for every non-terminal. You can include terminals as well if it makes it easier, however the FIRST set for a terminal contains simply that terminal, so it is quite easy to make a special case of terminals.
“FIRST(s)” is the set of all terminal symbols which could be the “first” terminal in an expansion of “s”. We compute this using a similar iterative approach to calculating “nullable”.
We walk through all the productions in the grammar and add to FIRST() of the head symbol, the FIRST set of every symbol in the body which is preceded only by nullable symbols. So the FIRST set of the first symbol on the RHS is always added, and the FIRST set of the second symbol is too if the first symbol was flagged as being nullable. And so on.
We continue this process checking each production again and again until no more changes in the FIRST sets can be made. Then we stop.
Once we have the nullable flag and the FIRST sets we can proceed to the third step which is to create the “FOLLOW” sets. This again follows a pattern of repeating a partial calculation until no more change can be made.
To find “FOLLOW(s)” we find every occurrence of “s” in the body of any production, and add FIRST(t) where “t” is the symbol after “s”, or any symbol following “s”but only separated by nullable symbols. If all of the symbols following “s” are nullable, then we also add FOLLOW of the head of the production.
Adding the FIRST of following symbols only needs to be done once, as FIRST is stable and won’t change any more. Adding FOLLOW of the head might need to be done repeatedly as other changes to FOLLOW sets might need to flow back to this one.
Again, once we don’t find any more changes we can stop. At this point we have complete FOLLOW sets and can determine if the grammar is SLR(1), and if so: what the action table is. If the grammar is not SLR(1), it might be fixable with precedence annotations.
LALR(1) grammars
SLR(1) grammars are pretty general but not completely so. If a non-terminal is used in two quite different parts of the grammar then it is possible that it is followed by different symbols in different places. This can result in it reporting conflicts that won’t actually happen in practice. This is because we have been too general in generating the FOLLOW sets, assuming that the FOLLOW set of any symbol is the same in every state. Often it is, but not always.
If we are going to bother generating FOLLOW sets, and particularly if we are doing it for the purpose of checking if there are any unexpected possible inconsistencies in the grammar, then we may as well do it properly. Doing it properly is called “LALR(1)”, where “LA” stands for “LOOK AHEAD”.
If you look in the literature you can find lots of different ways of calculating the LALR LOOK-AHEAD sets. Many of these start with the states we have from the LR(0) calculation and go from there. I don’t really like those – I find them confusing and they don’t really make it clear what is happening (without lots of head scratching). So I’m presenting my way of doing it. I didn’t find this in the literature, but maybe that is because I didn’t understand all that I read. I’m sure someone did come up with it before me.
This procedure requires the FIRST sets and the ‘nullable’ flag developed for the SLR grammars, so we will keep them but discard the FOLLOW sets. In place we will generate lookahead or LA sets, one for each item in each state. So LA(s,i) is a set of terminals which we expect to see after the item ‘i’ is reduced in or following state ‘s’.
If we go back to the description of items and state and go to table it is not hard to see in retrospect that we lost some information, at least if we were considering an LR(1) style grammar. We didn’t include the look-ahead symbols in our description of state, but now we must.
So: each item in a state is not only a production and an offset in the body of that production. It must now also contain a set of terminals which can follow after that production (i.e. after the body or, equivalently, after the head). For the items based on the “start” symbol that are used to populate the first state (state zero), the look-ahead sets contain only the special “end-of-file” terminal.
When we complete a state by adding an item for each production from each non-terminal which follows “DOT”, we calculate the LA set for that item in much the same way that we calculated FOLLOW. We add together the “FIRST” sets for all subsequent symbols which are separated from the target non-terminal only by nullable symbols, and we use this sum of FIRST sets as the LA set of each item we make from that target non-terminal.
If all of the symbols following a non-terminal are nullable (or if there are no symbols following it), we need to add in the LA set for the source item.
So if a state contains the item:
A -> B C . D E {L}
Where “{L}” is the look-ahead set, then when we add an item based on a production with “D” in the head, the look-ahead set must be set to FIRST(E), and if E is nullable, this must be unioned with “L”, so
D -> . X Y Z { FIRST(E) + L}
When we create a new state using selected items from an old state, the LA sets simply get copied across with them.
Now we come to the interesting part. If you look back at how we generate state before you will recall that each time we generated a new state we had to check if it was in fact new, or if it has been seen before. If it has been seen before we just reused the old state.
Now that we are carrying extra information in a state we need to know what to do with that. There are two options.
Firstly, we can decide that the look-ahead set is an important part of the state (which it is) and so if the look-ahead sets are different, we have to create a new state. This will mean that, in general, the LALR process will create more states than the LR(0) or SLR(1) processes. Often many many more. If you choose this approach you actually end up with “canonical LR(1)” which is described next. So let’s not get ahead of ourselves.
If you decide that while important, the look ahead sets shouldn’t justify making new state, then you can instead update the old state by adding in (i.e. set-union) the LA sets of the new state with those of the old state. This loses a little bit of information and so there can be grammar which will end up with conflicts after the LALR process, but which will work fine with canonical LR(1). But in many cases there won’t be conflicts, so LALR is enough.
As we have updated the LA sets of a state, and as the LA sets can propagate through to other states, we need to re-process every state where we changed the LA set. This can result in processing each state twice or even more on average. It appears to be only a constant increase in effort, so should not be too burdensome.
After we have generated all the new states, and propagated all LA set changes as far as they will go, we have complete LALR(1) LR sets from which we can form action tables.
Again, these might result in completely unique actions for each terminal, or might result in conflicts. If the former, we say the grammar is LALR(1). If the latter, it is not.
LALR(1) usually provides enough power without too great a cost and seems to be the best trade off. As noted I am most interested in LALR not for the high quality action tables it produces (I don’t want to use them) but for the fact that it will calculate the LA sets and tell me if there are any cases were both a SHIFT and a REDUCE might be appropriate, and can provide extra information to tell me why a state might have to different possible REDUCE actions. This information will, I think, be invaluable in understanding the grammar as I am creating it.
Canonical LR(1) grammars
The “canonical LR(1)” style of grammar is the original one described by Knuth before others found ways to simplify it and make it more practical. For LR(1) “usefully” (as in “until we have a set of symbols that can usefully be reduced”) requires a lot more states than “LR(0)” and all the intermediate forms. These states reflect the various different possibilities of look-ahead and generally aren’t very interesting.
There certainly are grammars which are LR(1) but not LALR(1), but I don’t think they are very interesting grammars. Still they might be useful to some people.
The only conflicts that LALR(1) will report but that LR(1) will know are false conflicts are REDUCE-REDUCE conflicts. i.e. two different reducible productions in the same state. As I have already said I think this is a design mistake so I have no use at all for LR(1). If LALR(1) reports a conflict, then I think it is an error even if it is not really a conflict.
To construct the states for LR(1) we follow the same approach as LALR, including an LA set with each item in each state. However when we look to see if a new state already exists, we require that both the core items and their LA sets are identical. If they are, then it is the same LR(1) state. If not, we need a new state. No merging of LA sets happens, so we don’t need to repeatedly process each state.
This produces more states. I have one simple grammar which results in 27 LALR(1) states and 55 LR(1) states. I have another I’m playing with which currently has 174 LALR(1) states and 626 LR(1) states.
An Implementation.
Theory is never really interesting without practical implementation to show how it works. One of the great appeals of computer programming is that creating practical implementations is quiet easy (unlike economics for example, where experimenting with national economies requires way too much politics!)
So I have written a parser generator that realises almost all of this theory. It can use any of the different analysis approaches listed above. The generated parser only uses LR(0.5) though. I don’t bother building a list of reducible productions and choosing between them based on look-ahead as I don’t think this will ever actually be useful.
Also the generator does not yet handle precedence annotations. The annotations can be given by they are ignored. I simply ran out of time and wanted to get this published before yet another month disappeared. I hope to add the precedence handling later.
The link above is to the version as it is today. I have other plans to enhance it as will be seen in my next post but I wanted the code linked here to represent only the ideas listed here.
If you want to try this out, you can:
git clone git://ocean-lang.org/ocean cd ocean/csrc make; make ./parsergen --LALR calc.cgm ./parsergen --LR1 calc.cgm
This will print a report for the grammar showing all of the states complete with items and LA sets. This might help with following the above explanation which is, a some points, a little dense.
Note that if you want to create your own grammar, you need a “markdown” formatted file prepared for `mdcode` processing. Hopefully `calc.cgm` is a usable example.