In two previous articles I explored an approach to enhancing an LR parser to work with indents and line breaks. While I discovered some useful ideas and produced some code that seemed to work, I’ve subsequently discovered some serious flaws in the reasoning.
For indents, my reasoning about exactly when to REDUCE in the face of an OUT token was flawed and didn’t properly address all cases. I’ve made a few updates to that article to highlight this failing. For linebreaks, I only talked about when they should be ignored and didn’t cover the other important question of their role in terminating things. I hadn’t at the time seen how important that was.
So now I want to rectify these problems and present a more complete solution. As I have explored around these problems I’ve seen other smaller issues and made a number of changes to my approach. The big picture is still much the same but some of the details are different in important ways. I also think I understand it all much better and so will try to explain things more clearly.
The Tokens
The tokenization of indents and linebreaks is unchanged. Every change in indent results in an IN token or an OUT token, which are paired. Every line break results in a NEWLINE token. However the NEWLINE token that would be expected before an IN is moved to after the matching OUT. So every OUT is both preceded and followed by a NEWLINE.
Recursive Symbols.
Understanding recursive symbols is very important for managing indents and linebreaks. A key property of both NEWLINE and OUT tokens is that they might “close” something. i.e. they sometimes indicate that everything since the start of the line, or since the matching IN, forms some unit which cannot be further extended. This “unit” is some (non-terminal) symbol (A) and has some production for this particular instance:
A -> B C D
If the top few symbols on the parse stack are “B C D” and the context is such that an ‘A’ was expected, then we might assume that we are at the end of the production and it cannot be further extended. However if any of B, C, or D are recursive, then that isn’t the case.
A “recursive” symbol here is one with a production which starts with the same symbol. e.g.
D -> D E
If such a production existed then having “B C D” on the stack does not mean we are at the end of the “A”, as an “E” could come along producing “B C D E” which could be reduced to “B C D” and so the “A” has moved further into the input stream.
Obviously if we found “B C D” on the stack when we saw an OUT or a NEWLINE we could reduce that to “A” and be sure that “A” was closed. However if “A” is recursive that isn’t enough. If “A -> A C” were a production, then a “C” could come along and extend the A.
The conclusion is that recursive symbols require special care. We cannot simply disallow them as they are very useful for any “list” like construct. So we need to make sure we can always remove them from the stack.
As suggested in a previous article, the best approach is to create an intermediate symbol “$A” for each recursive symbol “A”, add the production “$A -> A”, and then wherever “A” appears in a non-recursive position in the body of a production, replace it with “$A”. So in the final grammar a recursive symbol will only appear in the body of a production when it is also in the head of that production, or in the synthesized “$A -> A” production. Then when we have ‘A’ on the top of the stack and need to ensure it is closed, we simply reduce it to “$A”.
Understanding the Stack
In my original implementation I constructed the stack in a less than optimal way which made some of the code quite clumsy. An important property about the stack is that it contains symbols and states, and these alternate.
At the start, state “zero” is pushed onto the stack. Then the first token is pushed on as a terminal symbol. The symbol is looked up in the “go to” table for the state below it, and the new state is pushed onto the stack. Then we either REDUCE or SHIFT but in either case we will add a symbol, and then use the “go to” table to find the next state.
My original implementation combined the state with the following symbol into the one frame. This seemed to make sense as a state was on the bottom of the stack, so a state should go on first. However another way to look at it is that each state is intimately connected with the preceding symbol as the symbol leads directly and immediately (though a “go to” table) to the state. The code turns out to be much nicer if a state is combined with that preceding symbol to form a frame, though maybe it is better to say, a symbol together with the subsequent state.
The first frame on the stack then contains state zero and no symbol. This feels a little bit untidy, but makes other things much cleaner. As the state never changes from zero we could possibly do away with that frame altogether, but that isn’t really a net win.
With this clear understanding of the stack, as “symbol -> state” pairs, we need to ask how pending indents are recorded. As we determined before, indents often appear inside symbols and keeping a count together with the symbol is important and unchanged. The interesting question is how to store the indent that might be at the start of a symbol. To put this another way, an IN token appears between two tokens. Should it be counted with the one before, or with the one after?
We could argue that the very first token might be an IN and it would have no previous token to store the count with. However we have already seen that the first frame on the stack has an absent token. If the first parsed token was an IN, it could be counted with this absent token in the first stack frame. We could even pretend that there was a “start-of-file” token to match the “end-of-file” token which terminates the parse. However this is just book keeping and doesn’t really provide a firm answer.
In my previous analysis I counted each IN with the following token and found the need to distinguish the case of an indent being at the start of the symbol, against all the indents being within the symbol. This tells us that the indent immediately before a symbol doesn’t entirely fit with the indents within the symbol. So maybe it is best to count the IN with the preceding symbol.
We are getting a bit ahead of ourselves here, but one of the important places were indents are needed is determining when to ignore NEWLINEs. This determination is based on the states that are in the stack. Some states identify as “starts_line” and if there is such a state on the stack with no subsequent indent, then NEWLINEs are not ignored. If there is an indent since the last “starts_line” state, then NEWLINEs are ignored. However an indent between the same two symbols as the state is between, does not cause newlines to be ignored.
This suggests that if an IN and a state are between the same pair of symbols, the IN should be “before” the state. This is most easily handled if the IN is counted with the preceding symbol. So that is what we will do. It removes all the need to track whether an indent was at the “start” of a symbol, as it never will be.
Which states expect a NEWLINE
One of the important properties of indents is that they cause newlines to be ignored – sometimes. An indent signals continuation, the newline that was before the indent has already been moved to after the indent to reflect this continuation, but what about newlines between the IN and the OUT? If the syntax elements in there don’t expect newlines to terminate or separate them, then any newlines found need to be ignored. But how do we know?
I talked previously about “line-like” symbols in the syntax but wasn’t entirely happy with the definition. They were states that were followed by things that other times followed newlines, and it isn’t really clear what that means.
Both an OUT and a NEWLINE will trigger REDUCE actions to close out pending productions which could otherwise extend beyond the OUT or NEWLINE. They need to eventually get one symbol (non-recursive when an OUT is present) on the stack which represents the indented section (and possibly what lead to it), or the line (or lines). This symbol needs to be found before the NEWLINE is SHIFTed. So a “line-like” symbol must be followed by a newline.
The important difference here is that I was previously imagining that a line-like symbol would end with a newline, but that didn’t seem to work out. It seems to work better if NEWLINEs are treated as separators rather than terminators so the NEWLINE is not part of the line-like symbol, but part of something bigger (block-like?) which contains multiple lines.
Following this observation, the parser generator needs to determine which symbols can start with a NEWLINE (previously it determined which could end with a NEWLINE), and then any symbol which is followed by a starts-with-newline symbol is deemed to be a “line-like” symbol. Finally any state which is followed by a “line-like” symbol is “starts_line” and must expect a NEWLINE.
This seems to make sense. In a state which expects a symbol that is normally followed by a newline, we need to not be ignoring newlines. In any other state it is safe to ignore newlines if there has been an indent.
A core idea here, important when designing the grammar, is that when NEWLINEs are not ignored, each line will be reduced down to a single non-terminal symbol, which is followed by the NEWLINE. A NEWLINE should never be placed, in the grammar, after something that is just part of a line, e.g. `Variable = Expression NEWLINE`. The NEWLINE must be after a line-like thing: `Assignment NEWLINE`.
What, exactly, to reduce when we see a NEWLINE or OUT.
When we see a NEWLINE in a context where a NEWLINE is expected (not ignored) we want to reduce everything back to the start of the line. This means that we need to record where lines started in much the same way that we need to record where indents were found. We don’t need to count how many there were, just whether there was one in, or following, a symbol or not.
This became particularly apparent to me when considering the parse of
if Condition : simple-statement
The grammar is (deliberately) ambiguous in that a NEWLINE at the end of this line could separate the “simple-statement” from a subsequent statement within the body of the “if”, or it could separate the whole “if” statement from a subsequent statement at a higher level. Both the “if” statement and the “simple-statement” are line-like. Clearly (I hope) the whole needs to be reduced (to a “statement”), right back to the start of the line.
So when we process (SHIFT or DISCARD) a NEWLINE or IN (both of which signal the start of a line), we record a line-start in the topmost stack frame (just after the symbol in that frame, and before the state). We also keep track of how many stack frames do not contain a line-start. This is easily done by storing zero when a line-start is present and storing one more than the previous frame when there is no line-start.
Then when we see a (non-ignored) NEWLINE and it cannot be SHIFTed we must obviously REDUCE or trigger an error. If we can SHIFT the NEWLINE we may still want to REDUCE first. Specifically, if the line isn’t already reduced to a single symbol, and if we can perform a REDUCE with the number of symbols being popped not more than the distance to the nearest start-of-line, then we perform the REDUCE. If there is only only symbol in the line, if we cannot REDUCE, or if the reduction would pop more off the stack, then we know that when we SHIFT the NEWLINE it will serve to separate the whole line from whatever comes next.
When we see an OUT, we similarly want to REDUCE until we find the matching IN on the stack. In the original design I would reduce until the top-of-stack contained an IN, and then would cancel that. That is still true but only if it is possible. We need to understand what to do when it isn’t possible to reduce but the top stack frame doesn’t contain an indent. Consider a general line-like symbol with a production:
A -> B C D E
This could be found with an IN after B, then an OUT and a lesser IN after D, as in
B C D E
This would tokenize as “B IN C D NL OUT IN E NL OUT NL”. The first two NL would be ignored as they are indented. The first OUT should reduce back to D, but no further – even though the most recent IN is still two frames away. The second OUT would reduce to E with the nearest IN 1 frame away, but it requires the final NL to reduce the whole thing to A.
If we can reduce so that the top of stack contains an indent it is definitely safe to cancel the OUT. If we cannot reduce that far then sometimes it is safe, as above, but sometimes it will not be – we would instead need to enter error handling because some production is still open that needs to be closed but is not yet complete. To be able to tell if some production is “open” we need to know details about the pending productions but a state doesn’t really carry that information. It is all present in the “itemsets” that are used to build the states. It seems we need to preserve some more information from that parser generator into the parser.
The itemset which is the basis for each state contains a number of productions each with DOT at some point in the body. Those productions where DOT is at the start are not really open yet. There is no sense at all in which they are committed to as no contents have been found. So there is no need to close them. The remainder must all start before the indent that is being canceled, much as in the example above, the “A -> B C D E” production started before the first indent. If a production in the state starts at or after the recent indent then canceling the OUT would be premature. For the parse to be able to detect this case it needs to know the least non-zero location of DOT in any production in the itemset which leads to each state. We will call this the least prefix.
This number is trivial to calculate and pass to the parser. With it the parser can easily make the decision to REDUCE, ERROR, or CANCEL when it sees an OUT.
- If the current state can be reduced and the length of the reduction is not more than the distance from the most recent indent, then we reduce. This means that of all the stack frames popped as part of the reduction, only the first can contain an indent. As a consequence any reduction of length zero or one will happen, and so any recursive symbol will be reduced away.
- Otherwise if the least prefix of the state contains an indent, then the top most indent is canceled and the OUT is discarded.
- Otherwise we cannot close something that is wholly within the indented section and must raise an ERROR. This will allow us to remove things from the stack until the OUT can be canceled.
So there are some similarities between reducing on NEWLINE and OUT, and some differences.
The similarity is that we need to track where the “other end” (start of line, or IN) is and force a REDUCE within that range, even if a SHIFT is possible.
One difference is that for OUT, the range includes the symbol containing the IN and we reduce anything in that range. For NEWLINE, the range excluded that symbol and we only reduce if there are at least 2 symbols in the range. This is related to the fact that recursive symbols are problematic for IN/OUT management but not for NEWLINEs. With newlines, the NEWLINE itself will ultimately be shifted which will stop a recursive symbol from being extended inappropriately. With IN/OUT, the OUT is not shifted, so we must completely close any recursive symbols first.
The other difference is that an inappropriate NEWLINE causes a normal error because it cannot be shifted. An inappropriate OUT causes an error because the state says it cannot be canceled.
Summary of details
Now that we know how it can all work, we need to make sure we have the details straight:
- The parser generator needs to prevent recursive symbols (with productions “A -> A …”) from appearing other than at the start of the recursive production, or as the sole body element in another production. If they do a new symbol is created with a single production “$A -> A” is added. This new symbol is used in place of the recursive symbol where it isn’t allowed.
- The parser generator needs to record which states expect a newline. These are precisely those that can be followed by a line-like symbol. These, in turn, are those which can be followed in a production by a NEWLINE or any symbol which starts with a NEWLINE.
- The parser generator needs to record the minimum non-zero distance from the start of each item to DOT in each itemset. This “minimum prefix” is recorded for each state.
- The state stack needs to contain
- A symbol (possibly ‘$start_of_file’) with associated parse information (e.g. abstract syntax tree fragment),
- the state which follows that symbol (state zero for the first frame on the stack),
- the number of indents within or immediately after the symbol. These are considered to be before the state,
- a count of the number of stack frames since the start of a line. Zero implies an IN or a NEWLINE is within the symbol, or that this is the first frame. A larger number is one more than the number in the previous frame and implies that the symbol contains no NEWLINEs or unpaired IN tokens.
- a count of the number of stack frames since the last unmatched IN. As all states with a non-zero count will have zero for the number of indents, this value can be stored in the same field as the indent count, but with a different sign.
- A flag “expect_nl” indicating whether an indent or a “starts_line” state is closest to the top of the stack. It is true if “starts_line” is closest. If the state in a frame is “starts_line”, then the flag is “true” no matter whether there are indents. Otherwise the flag is true only if it is true in the previous frame, and there are no indents in this frame.
- When an IN token appears in the look-ahead, we increment the indent counter in the top stack frame, set the frames-since-start-of-line counter to zero and set the frames-since-last-IN counter to zero
- When an OUT token appears in the look-ahead we REDUCE if the current reduction uses no more symbols than the frames-since-unmatched-IN count. If the state cannot be reduced or has a longer reduction pending, then cancel the OUT against the top-most indent if the frames-since-last-IN counter is at most the minimum prefix of the current state, otherwise initiate error handling. This decrements an indent counter. If that becomes zero we need to increase the frames-since-last-IN counters in the upper regions of the stack.
- When a NEWLINE token appears in the look-ahead, if “expect_nl” in the top frame is false, then the NEWLINE is discarded.
- When a NEWLINE token appears and “expect_nl” is true, then if the current state can REDUCE and the reduction length is no more symbols than the frames-since-start-of-line count, and that count exceeds 1, then we REDUCE. Otherwise we SHIFT if we can, REDUCE if we can’t, and ERROR if we cannot even REDUCE. In any case we set the (new) topmost frame to have zero frames-since-start-of-line.
- When we reduce a series of frames,
- The number of indents in all the frames are summed and the result saved in the new frame
- If any frames-since-start-of-line value is zero, the new frame gets a zero, else it gets the value from the first frame
- If there is no indent, then the frames-since-unmatched-IN is set to one more than the previous frame.
Implications for grammar design.
The most important implication this has on grammar design is that any symbol that appears immediately before a NEWLINE (or before any symbol that starts with a NEWLINE) is considered be a line-like symbol. As such, NEWLINEs found while collecting the symbol will not be ignored, even if the symbol is on an indented line.
So:
Statement -> Variable = Expression NEWLINE
would cause “Expression” to be treated as line-like, so in
a = b + c + d
The NEWLINE after the first ‘+’ would not be ignored and would cause an error.
Consequently the NEWLINE should be placed after a Statement or similar, possibly like
Statement -> Variable = Expression StatementList -> StatementList NEWLINE Statement | StatementList ; Statement | Statement
This will expect NEWLINES wherever a StatementList is expected, but not after a subsequent indent.
Another effect to be careful of is the handling of optional NEWLINEs. A construct like
OptionalNewline -> NEWLINE |
will sometime work, but often not. In particular, if this occurs at the end of a section that can be indented, an OUT NEWLINE sequence could appear where OptionalNewline is expected. The OUT will likely force a reduction of “empty” to OptionalNewline, where the intent was to capture the following NEWLINE.
One way to handle this is to tie the NEWLINE to the following symbol. So if you wanted a newline before a ‘{‘ to be optional, then instead of “OptionalNewline {” you need
Open -> NEWLINE { | {
and then just use “Open”. In this case, the OUT cannot force a reduction. The reduction won’t happen until the ‘{‘ has been seen.
Error Handling
A couple of places in the preceding text I’ve mentioned error handling without any details. Some details are in previous posts but they are imperfect. In particular, where it is suggested that a symbol to SHIFT might be synthesized, that is unworkable. While the symbol itself could be synthesized automatically, the abstract-syntax-tree fragment for it cannot.
Full details on error handling will have to wait for another post. One observation I can make now though is that symbol synthesis can be probably be managed with productions like:
Expression -> ERROR
The reduction code with this can create some appropriate AST element. Exactly how this will be integrated remains to be seen.
Indents in action.
This article is being published months after I wrote most of it. This is largely because it took too long to get the code working. There were a few little bugs in my design and I found it hard to find the time to concentrate and find them and fix them. This is part of the cost (and benefit) of wanting to have working code for all my ideas…
The current parser generator implements almost all of these ideas and can demonstrate working code. It doesn’t current automatically hide recursive symbols, you have to do that step by hand. Accompanying this I have a trivial grammar for testing indents and newline handling. This will read in something that looks like code with non-line-like expressions and line-like statements and will print out the parsed structure.
If you collect git://ocean-lang.org/ocean
and “make ; make tests
” in “csrc
” you will see it run.