In my earlier note about LR parsing I observed that many simple grammars will only ever have at most one REDUCE action in any given state. This means that there is no need for an “action table” to list which of several productions to reduce based on different look-ahead symbols. I even went so far as to say:
I personally cannot see why you would ever want a grammar which had two completed items in the same state. It means that some sequence of input tokens could be treated as one thing or as another depending only on what comes next. That sounds a like a design mistake to me. Maybe I’ll eat my words later, but for now this means I cannot find a useful example – sorry.
I have since found some examples which shed valuable light on this issue and give me a chance to see how my words taste.
In the first case, the “sequence of input tokens” that could have varying interpretations is empty. It hadn’t even occurred to me to consider such a case, but empty sequences can be important. They can be particular important for optional elements. In this case optional prefixes.
Consider the grammar:
Number -> Sign Digits Sign -> - | + |
This says that a “Sign” is a “+” or a “-” or nothing, and that a “Number” is a “Sign” followed by “Digits” (whatever they are). In this case, if the next symbol is whatever “Digits” starts with, then we need to REDUCE the empty sequence to “Sign” before we can SHIFT that symbol. By itself, this doesn’t cause a problem. If we see a “+” or “-” we SHIFT, otherwise we REDUCE.
However suppose that variable names can be prefixed with “$” or “@” with some unspecified meaning, and suppose we have:
Value -> Number | Variable Variable -> Sigil Name Sigil -> $ | @ |
Now if we want a Value and see a ‘$’ or an ‘@’ we assume a “Variable” is coming, while if we see a “+” or a “-” we assume a “Number” is coming. But what if we see something else? If we see something that starts a “Name”, we should reduce the empty sequence to “Sigil” while if we see something that starts “Digits” we should reduce the empty sequence to “Sign”. So here is a case where “some sequence of input tokens could be treated as one thing or as another depending only on what comes next“. The “sequence” is empty, the “one thing” is a “Sigil” and “another” is a “Sign”.
This situation would be even more interesting if there was some symbol that could be either a ‘Sign’ or a ‘Sigil’. A grammar which included that could still be parse with an LR(1) parser (providing Names and Digits could be distinguished in their first token). However that is something that I would consider a design mistake. I don’t need to be able to handle that, but I would like to be able to handle distinct optional prefixes.
A similar situation came up as I was trying to design the “Newline” handling for my “ocean” parser. After an “if Expression Statementlist” I want either 1-or-more newlines, or o-or-more newlines and an ‘else’.
CondStatement -> IfPart IfSuffix IfSuffix -> Newlines | OptNL else Block Newlines -> NEWLINE | Newlines NEWLINE OptNL -> Newlines |
After parsing the “IfPart”, a “Newline” should always be SHIFTed and then reduced to “Newlines”. But otherwise we can either REDUCE that “Newlines” to “IfSuffix”, or to “OptNL”, the latter if “else” is the next symbol. So again the one sequence can be treated as two different things (Newlines or OptNL) depending on what comes next.
These can both be resolved without changing the parser by spelling things out a bit more:
Number -> + Digits | - Digits | Digits Variable -> $ Name | @ Name | Name IfSuffix -> Newlines | Newlines else Block | else Block
This requires “Digits” and “Name” and “else Block” to appear multiple times, but otherwise is reasonably simple and has exactly the desired effect.
So what to do? One option would be to manually apply transformation like the above when the problem arises. That is simplest and what I have done so far, but it isn’t entirely satisfactory. Repetition, as it requires, feels clumsy and is error prone.
Another option is to bite the bullet and change to a full LALR grammar with a more complete “action” table. This would be awkward because it still means that some states allow multiple different REDUCE steps. The algorithm I developed for handing indents and newlines requires that in some circumstances we force a reduction. That algorithm would need to be refined to be able to identify which reduction should be applied. It isn’t immediately obvious how that would work.
We could teach the parser generator to apply the same transformation that I applied above. Whenever any production contains a nullable symbol it is duplicated to have a second production which doesn’t contains that symbol. Then any production that causes the symbol to derive to the empty string gets disabled. This could probably be made to work, but seems like it could make the grammar much more complex, and so make it harder to understand when things go wrong.
I think what I would like to do is find a more minimal transformation of the grammar. One which still results in at most one REDUCE action per state, but which as close as possible to the original.
If we consider the second example with the newlines, then the problem occurs in a state with the following itemsets (as reported by my parsergen program with –LALR enabled).
Newlines -> Newlines . NEWLINE  LOOK AHEAD(6): NEWLINE, $eof, else OptNL -> Newlines .  LOOK AHEAD(5): else IfSuffix -> Newlines .  LOOK AHEAD(0): $eof
If we consider the middle item, effecting the REDUCE operation that it implies would pop the “Newlines” off the stack and replace it with “OptNL” with an updated ‘state’. This state would allow ‘else’ to be SHIFTed, but otherwise would be ignored. As “OptNL” and ‘Newlines” are very similar symbols, neither of which carry any extra ‘AST’ information, there would be no harm in just leaving the “Newlines” on the stack (as we noted previously, the symbols on the stack are hardly used) and directly SHIFTing in the ‘else’.
So if we add to the ‘goto’ table for this state an entry indication that ‘else’ can be shifted and go to whichever state ‘else’ ultimately leads to, and otherwise ignore this item so that the only reducible item (or “completed item”) is “IfSuffix -> Newlines .”, then the parser will be able to correctly parse the language without the multiple REDUCE problem.
The other example is a little different and not so simple. The problem item there contains, among many other items:
Sigil -> .  LOOK AHEAD(1): Name Sign -> .  LOOK AHEAD(2): Digits
There are several differences here. Firstly reducing the productions does change the stack – it pushes a single frame on. Secondly, both production needs to be treated the same. In the first case we could say “OptNL” is similar to “Newlines” while “IfSuffix” is very different. That distinction doesn’t apply here.
We cannot convert either of these into simple SHIFT operations like we could previously, but a “REDUCE+SHIFT” operation where the shift is very simple might be an option. The “REDUCE” part would not pop anything off the stack and just push on a very minimal frame. This would require that there was no code associated with either production. The symbol either has type “void”, or has some small type (say: 8 bytes at most) which is left as zeros.
Pulling this all together, I imagine two changes.
Firstly, an extra bit is added as a flag to each entry in the “go to” tables. We can probably steal a bit from the “state” value and so limit our grammars to 32767 states. If the parser chooses to shift a symbol which leads to a ‘go to’ entry with this bit set, it first shifts in an “empty” stack frame with no symbol, no state, and an ‘asn’ (Abstract Syntax Node) with 8 bytes of zeros. This is the only change made to the parser.
The parser generator is changed to translate some ‘completed items’ into SHIFT actions instead of REDUCE actions. This only applies when the production in the item has no code fragment attached and the structure attached to the head of the time is no larger than 8 bytes. It also only applies if all symbols in the item’s look-ahead set are not present in other terminal items.
If the body of the item is empty, then a ‘go to’ entry may be created with the new flag set. If the body contains a single symbol as the same type as the head of the item, a regular ‘go to’ table entry might be created.
There are various other details about how to determine the target state and how to decide if the transformation is legal. A particular situation that would make the transformation impossible is if the state in the synthesized frame might ever by used. These details would need some careful thought and analysis which I don’t feel is particularly justified just now. After I have had more experience, and possibly found more interesting case, it might then be time to explore extension to the parser generator in greater detail.
So there is no code to show this time, just an admission that I still have things to learn…