My previous note introduced the problem of parsing a two-dimensional language and the need to pay attention to indenting and line breaks and use them to guide the parsing of the language, both to enhance error detection, and to resolve ambiguity. The key intuitive insight which guides this investigation is that indents imply continuation while line breaks terminate things – sometimes.
That first note only looked in detail at indents. It provided rules by which a reduction in indent level can force the end of a syntactic unit by triggering a REDUCE operation in the LR parser. This note completes the solution by exploring and describing how line breaks can be interpreted when parsing a sentence in a two-dimensional language.
To start getting a feel for how line breaks work it will be useful to have an example to work with
1: if condition: 2: a = b + 3: c + 4: d 5: print a 6: else: 7: print "message" 8: return
Here the numbers at the start of the lines are just for our reference and not part of the program fragment.
The handling of indents described in the previous note will ensure that, for example, the expression that starts on line 2 with “b +” will be terminated before line 5, as line 5 has a lesser indent. However the indents by themselves do not tell us that the assignment starting on line 2 is a separate statement from that on line 5. To determine that, we need to process the line breaks.
Some of the line breaks have little meaning. Those at the end of lines 2 and 3 could be moved back or forward over at least one token with no important effect. Also the line breaks at the end of lines 1 and 6 are only really important in that they introduce an indent. The fact of the line break is otherwise uninteresting.
On the other hand the line break at the end of line 7 closes the “print” statement on that line, as well as the “if” statement which started on line 1. An alternate reading is that it closes both the “print” statement and the “else” clause that started on line 6. With the else clause closed, the “if” statement must be closed.
So it appears that sometimes we need to discard line breaks and other times we need to multiply them. How can this work?
The apparent need to multiply line breaks can be handled by moving some line breaks that are otherwise uninteresting. As noted the line breaks that immediately precede an increase in indent are uninteresting beyond that indent. So we can move them to after the matching decrease in indent.
To make this a little clearer, consider the following which has the same two dimensional structure as the previous example but with simplified content.
A B C D E F G H
When we tokenize this we create “IN” and “OUT” tokens to record the indents, and add a “NL” token (newline) to record the line breaks. Any NL that would be before an IN is moved to after the matching OUT. This reflects the fact that the indented content is in some sense a continuation of the starting line.
So the tokens produced by that above are:
A IN B IN C NL D NL OUT NL E NL OUT NL F IN G NL OUT NL H NL
So there are 8 “NL” tokens as you would expect for 8 lines, and each OUT token is both preceded and followed by an “NL”, allowing both the inner and outer structures to possibly be terminated.
With this parsing of newlines there is no longer any desire to multiply newlines, only to sometimes ignore them.
Taking a new example – similar to the earlier one but with two distinct halves:
1: if condition: 2: print message1 3: print message2 4: a = b + 5: c + 6: d
Here lines 1-3 have the same two dimensional structure as lines 4-6. However the newline token at the end of line 2 must be meaningful while the newline at the end of line 5 should be ignored.
This distinction tells us that we cannot handle newline tokens purely automatically like we do with IN and OUT tokens. Rather the grammar for the language must tell us where newlines are expected, such as at the end of a statement.
So when the parser sees a NL on input it has a number of options.
- It can SHIFT the NL. This is only legal if the current state has a ‘go to’ entry for NL. If that is the case, then SHIFTing is most likely a sensible option, though possibly we should wait until a later NL after an OUT.
- It can REDUCE the production which is reducible at this state, if there is one. This only makes sense if this could lead to a state where NL can be shifted.
- It can discard the newline. This should happen for example at the end of line 5 in the example above. Or
- It can trigger error handling. This would be appropriate at the end of line 4 above if line 5 were not indented.
The important choice here is whether to DISCARD the NL or not. If we choose not to, then choosing between SHIFT, REDUCE, or ERROR is done in the same way we would for any other token.
This choice of discarding newline tokens must be closely related to the level of indent. The intuition is that indents lead to continuation. A block of text that is indented is a continuation of the line that preceded the indented block. Sometimes we ignore line breaks in indented text, other times we don’t. We need to determine how to tell the difference.
Falling back on my intuition again, I think that there are line-like things and word-like things. These may also be referred to as vertical or horizontal objects, to borrow some terminology from TeX. Statements and things composed of them are line-like. Expressions and things they are comprised of are word-like.
My intuition suggests that we would expect a line break to terminate a line-like thing, but not a word-like thing, unless in doing so it also terminates a line-like thing. In the example above, lines 2 and 3 are line-like things, so we expect the line break to terminate them. Lines 5 and 6 aren’t. Lines 4,5,6 together form a line-like thing and so the new line that was moved to after the OUT indent reduction will terminate that whole line-like thing.
Returning to the first example, the assignment starting on line 2 is separate from the statement on line 5 because statements are line-like and there is a linebreak separating them.
Were we using a recursive descent parser these ideas would seem quite easy to implement. We would first deduce which non-terminals were line-like as being precisely those that could ever expand to contain a newline token. Then, as a recursive descent parser always knows what non-terminal it is trying to recognise, it could ignore newlines precisely if there had been an indent since the deepest line-like non-terminal started.
As we are using an LR grammar, and so don’t always know what non-terminal we are trying to recognise, we need to take a slightly different approach. With LR, we do always know what symbol preceded a particular state. This is because each item in the core (i.e. uncompleted) itemset for a state always has the same symbol before “dot”. This is exactly the symbol which leads to the state through the “go to” table.
If we know what the previous symbol is at any state, and if we know which symbols can end in a newline token, then we know which states mark the start of a line-like thing. A symbol which can end with a newline may not always end with a newline as a statement could, for example, end with either newline or a semicolon. However any location where a newline could appear can reasonably be seen as a boundary between two line-like things.
So when we see a newline token in the parser’s look-ahead buffer, we can examine the current state stack looking at states which start line-like things and looking at internal indents. If the line-starting state closest to the top is above the top most internal indent, then we cannot ignore newline tokens: there is no indent to protect them. If, on the other hand, an internal indent is closer to the top of the stack than a line-starting state, then we are currently in an indented continuation of a line-like thing and so newlines should be ignored.
Rather than searching the stack every time we find a newline, we can track whether newlines should be ignored or not.
- Whenever we shift a state which is line-starting, the new stack frame is flagged as permitting newlines. When other states are shifted, we inherit the “newlines_permitted” flag from the previous stack frame.
- When we reduce some states then if the net indent in all the reduced frames in positive, we record the new top-of-stack frame as ignoring newlines (i.e. newlines not permitted).
- When we process an OUT token by decreasing the indent count on the top-of-stack frame, we must update the “newlines_permitted” flag depending on the state and remaining indent level. This may require copying the flag from the previous level of the stack.
Once we track the “newlines_permitted” state this way it becomes trivial to handle newline tokens. If one is seen when “newlines_permitted” on the top of the stack is false, we simply discard the newline. If “newlines_permitted” is true, we process the newline just like any other token.
Updating the grammar processing to track which symbols can end with a newline token, and which states can immediately follow such a token is fairly straight forward. The “ends with newline” flag can be computed when the “nullable” flag is computed with a similar repetitive processes. If the body of any production contains an “ends with newline” symbol followed only by nullable symbols, then the head of that production is flagged as “ends with newline”. Tagging the affected states is even more straight forward: if the symbol which lead to a new state “ends with newline”, the new state can start a line. If not and the symbol is nullable, the new state starts a line if the previous state did. Otherwise the new state does not start a line.
My parser generator, modified to handle line breaks as described here, can be found in `git://neil.brown.name/ocean/` or a this version.
One last issues seems worth mentioned before we depart this topic. That related to what should happen if a line-like object does not start at the beginning of a line. For example:
a = b ; x = y + z + q
This would be perfectly legal with the parser as currently described, but looks wrong and is possibly not what is intended. It seems hard to justify raising an error or changing the obvious parse when this happens, but it is almost certainly cause for a warning, and probably a very stern warning at that.
We would need to track which symbols started at the real start of line (i.e. after NL or IN). If an indent and the previous “starts line” state was not at the start of a physical line, then a warning should be raised. I haven’t implemented this yet, but I probably will.