LR Parsing with line breaks – 2021 edition

Yes, here it is over a year and a half since the last time I wrote about this, and I’m still working on it. Or maybe working on it again. But I really think I’ve got a much better solution this time. It certainly seems to be easier to work with.

I don’t clearly remember the problems that I had with the previous design, but I do know that it seemed fragile and there was particularly some issue with not having a token that clearly closed a list of statements – like a ‘}’ in C or “end” in Pascal. In any case I now have a new approach which is much simpler

IN, OUT, and NEWLINE in the grammar

All of my previous efforts involved the parser having deep knowledge of how indents work, so that the grammar didn’t need to specify them at all. This worked to some extent, but often felt clumsy.

The reason for this was that indents can appear at lots of places in a text that cannot realistically all be specified in the grammar. Indents can be used to signal a continuation of one line onto the next, and for that purpose the indent can appear anywhere in the line – it clearly is not practical to specify all those in the grammar. But that doesn’t mean we cannot specify anything in the grammar.

I had already established the idea that NEWLINE characters could appear in the grammar, and that incoming NEWLINE tokens could be processed or ignored depending, in some why, on what was expected by the grammar. My clever new idea is that the same approach can be taken to indent tokens (IN and OUT). Partly, this required getting a clear understanding of the different uses of indents – some which the grammar handles explicitly, and some which the parser handles implicitly. In some sense all indents are used for continuation and maybe this blinded me at first. While this is true, there a still two very different sorts of continuation.

The first sort of continuation is where text that would normally be on a line can be continued by indenting, so

a = b + c + d + e

can be written on multiple lines by using indenting:

a = b + c +
    d + e

Here the indent indicates simple line continuation, the parser hides the NEWLINE, and the grammar doesn’t notice a difference.

The other sort of indenting is where the continuation is a block of lines that couldn’t all be on one line (without a lot more syntax, like braces or similar):

if a == b:
    c = d + e
    e = d - b

Here the indented text is a continuation of the “if” statement, but importantly the text is multiple distinct lines, and the indent must happen at a specific place: after the colon (“:”). This is the sort of indent where including IN and OUT tokens in the grammar makes sense, as there are specific places they have to go.

This leads to the new rule: If an IN token is seen in a parser state when IN can legally be shifted, then the IN gets shifted, otherwise the IN gets ignored. This means that an IN can never result in a REDUCE step happening, or in error recovery being triggered. Possibly it could cause a reduce if the state didn’t allow anything to be shifted. This might be useful but is currently not implemented.

With IN handled like this, it is fairly obvious how OUT must be handled: We keep track of all unmatched IN tokens and whether they were SHIFTed or ignored. When an OUT token is seen, it is ignored precisely when the corresponding IN was ignored. When an OUT is seen that should not be ignored, it is treated just like any other token – it might be SHIFTed if that is allowed, or it might trigger a REDUCE, or it might cause error recovery to be activated. This means that the OUT at the end of a block of statements acts as a closing token and ensures the block is properly terminated.

An obvious (I hope) follow-on question is: does this approach help with deciding when to ignore NEWLINE tokens. In previous attempts I marked states according to whether a NEWLINE might be expected after this state, and ignored NEWLINES when there was an indent more recent than any marked state. That might still work, but a much easier alternative presents itself. Inside indents that are ignored, NEWLINEs should be ignored too, because we just have a continuation of a single line – and a single line doesn’t contain NEWLINEs. Conversely inside indents which aren’t ignored, NEWLINEs should not be ignored because we expect to have a block if lines. So now the rule for NEWLINEs appears to exactly match the rule for OUT tokens: either must be ignored if the most recent unmatched IN was ignored, and should be treated like normal tokens if the most recent unmatched IN was not ignored. This is simple and elegant, and certainly needs less code that the original design.

With these rules in place, the analysis of the grammar needs no special handling of any tokens. There are no “special” states and no need to do anything beyond the regular analysis required for an LR grammar. All the special handling happens in the parser.

The parser too needs less state than I was using before. I was preciously counting the number of indents in each non-terminal and whether a non-terminal started with an indent, and even where physical lines started. Much of that is now irrelevant. All I need is a stack of bits (or Booleans) which tracks the current indent level and records which unmatched indents were ignored. As the scanner ignored indents deeper than 20, this can be managed with a single integer of at least 20 bits (though currently I use a counter as well).

Implications for the grammar

As I worked on converting my test grammar and my preliminary “Ocean” grammar to follow these new rules I found that the language could be improved and that there was room for an extra feature in the parser that would make the grammar even easier.

Firstly, I had thought that any indented block of statements should be signally be something that would help the reader (and the parser) know that such a block was expected. Following the practice in Python, I used a colon (“:”) to mark such things. But as I experimented I realised that sometimes the colon seemed redundant. For example, in Ocean a “while” statement can express the condition either as an expression or as a statement block which uses the “use” statement to report value. For the second case I had chosen a syntax like:

while:
    statement
    statement
    use value
do: 
    statements

i.e. if there was no expression before the colon, then what came after was a “use block”. Now that IN tokens are full members of the grammar, that seems clumsy. As there is an IN there to mark the expectation of a statement block, there is no need for a colon, so now I have

while
     statement
     statement
     use value
do
     statements

which I think I prefer.

While it is possible to drop the “:” in this case, we still need the colon in the more traditional while which has an expression:

while a < b + c:
    statements

If we dropped the colon here, this might look like a normal continuation of the expression onto the next line. To avoid this confusion, the IN token in the grammar must be preceded by something that is not a recursive non-terminal. In my current implementation, it needs to be preceded by a terminal symbol. I could enhance the parser to work with some non-recursive non-terminals before the IN, but I’m not yet convinced it is worth it.

Introduction of the EOL token.

As I worked on the grammar I found that some productions needed to be terminated by the end of a line, but that I didn’t want to actually consume a NEWLINE character there. This particularly applied to a statement block which contained a list of simple statement that could all be on one line. Such a list needs to end either with a semicolon, or at the end of the line.

Block -> : IN statementlist OUT
       | { IN statementlist OUT NEWLINE }
       | : simplestatements ;
       | : simplestatements NEWLINE

The trouble with this approach is that the Block sometimes ends with a NEWLINE and sometimes doesn’t. So if I have:

IfPart -> if Expression Block

Then I don’t know whether to require a NEWLINE after that or not.

This conundrum led to the invention of the EOL token. EOL is a synthetic token that is not produced by the scanner but is synthesized by the parser when needed. If the parser has a NEWLINE as the look-ahead token, and the current state does not accept NEWLINE but does accept EOL, then a new EOL token is synthesized, and SHIFTed onto the parse stack, advancing to the next state. The original NEWLINE token remains as the look-ahead token so it can either synthesize more EOL tokens, or be shifted when its time comes.

Still to do….

I’m glad that I’ve arrived at this fairly simple solution, but I feel that I’ve sunk way too much time into something that should have been a fairly simple problem. I want to move on, but there is some unfinished business. Rather than spending yet more time on that business, I’m just going to write about it here – maybe I’ll come back and look one day.

As noted, an IN token must come after a non-recursive token in the grammar, preferably after a terminal. This is a requirement that the grammar analyzer could easily check for. So one thing I want is for parsergen to check that IN tokens are used correctly. Maybe I should even check for non-recursive non-terminals and ensure that all symbols before an IN from the most recent non-nullable symbol are all non-recursive. I should probably also enhance the parser to support all non-recursive terminals instead of just non-terminals as is currently the case.

I’m also in two minds about the handling of blank lines. Currently I handle them in the grammar and it isn’t too much hassle. But I think I might be happier if blank lines were absorbed by the scanner so that the parser didn’t even see them. I did try this in some recent explorations and one sticking point was comments. While it is easy to strip out blank lines, it was a little more awkward to make a sequence of blank lines and comments all disappear. Maybe it just needs more determination.

But for now, I’m happy with my handling of indents and line breaks, and it is time to explore more semantics in my language – probably pointers (or should I call them references) first.

This entry was posted in Language Design. Bookmark the permalink.

Leave a Reply

Your email address will not be published.


*