LR parsing with line-breaks – yet again

It has been some years since I last wrote about about parsing with indents and line-breaks. Some of that time I’ve been distracted by other things, some of it has involved working on other areas of language design, but some of it has seen me struggling with line breaks again.

In my last note I had made significant progress over my original design, and there is much in that note that is still valuable. But some of it I got wrong too.

Most particularly, the idea that NEWLINES should be separators rather than terminators was mostly wrong. I now believe they must be mostly thought of as terminators, though there is one area where we must blur our definitions a little.

A related area that I’ve been struggling with is handling blank lines. Since NEWLINES are parsed tokens, we need to be prepared to accept and ignore those that we don’t really care about. As blank lines are usually uninteresting (in program code at least), I need explicit support in our grammar to allow them. What this support must look like has caused a lot of head-scratching, but I seem to be happy with my current approach. Part of the reason that this was a struggle was that at the same time I was trying to make sure there weren’t any newline related conflicts in the grammar. I thought these were somehow related to the multiple newlines issue and while this was probably the case, there was a whole others side to parsing with indents and newlines that I had forgotten to think about. So I’ll need to discuss blank lines and this other issue.

But first to NEWLINES as terminators.

As I hinted in my last note, treating newlines as separators can cause expressions – when they appear at the end of a line – to look line-like, as they are followed by a NEWLINE. I suggested that care should be taken in the grammar to not put the NEWLINE immediately after an expression, only after a whole line-like thing.
e.g.

statement -> IDENTIFIER = Expression
statementline -> statement NEWLINE

However that is not sufficient. The NEWLINE is still assessed as being “after” the Expression (it is in the look-ahead set) so the Expression needs to be treated as line-like, and that is no good as we don’t want line breaks in indented expressions to be syntactically meaningful.

So I went back to using NEWLINES as terminators, but with a simpler measure for “line-like” symbols. Any symbol that can possibly derive a sentential form (i.e. a code fragment) which contains a NEWLINE is line-like. It doesn’t matter where in the fragment the NEWLINE is; start, middle, or end. If there could be a NEWLINE, then it is line-like. This is easy to think about, easy to code, and largely works.  Putting a slightly different perspective on this, if a symbol can “contain” a newline, then newlines must be important when parsing it, so they mustn’t be ignore.  So if we are in any state between starting the parse of a NEWLINE-containing symbol and ending that parse, and we haven’t seen an indent, then we need to pay attention to newlines. This is exactly the consequence of marking the “before a NEWLINE containing symbol” state as “starts line”

I’ve had one problem with this approach and it comes from a syntactic form that I mentioned previously, and is probably the one that pushed me to think of the newline as a separator.

if condition: simple statement

This line contains two statements, an “if” statement and a simple statement. I slight extension could even contain three:

if cond1: if cond2: simple statement

This form might seem unnecessary, but I want to allow it for generality’s sake.

If every statement must be terminated by a NEWLINE (or a semicolon), then this must be followed by two (or three) NEWLINEs. I hope it is obvious that requiring a blank line after this construct would never do.

I experimented with various modifications to the grammar and approaches to NEWLINE handling and most of them failed, often because they got too complicated and I couldn’t really understand the implications of what I was trying to suggest.

My solution is to add a special marking to the production, much like the marking that can be used to indicate precedence of operators. If a production ends with $$NEWLINE then it is handled a bit differently.

Block -> { Statements }
       | : Statements $$NEWLINE
       | : SimpleStatements $$NEWLINE

This means that when a code block is introduced by a ‘:’, it must be REDUCED by a NEWLINE. That is, a NEWLINE must be the look-ahead token. The NEWLINE isn’t shifted, so it can serve to reduce several of these, depending on what indents are in play.

This only works, of course, if Statements and SimpleStatements don’t have to end with a NEWLINE. SimpleStatements doesn’t. It needs a NEWLINE to be part of Statements, but not otherwise. In many cases Statements will end with NEWLINE, but not always. Particularly if the last statement ends with a Block there won’t be a newline required at the end.

Initially I found that this is the only place in the grammar that I need to have $$NEWLINE. Everything else works fine with the other rules that I had developed. However once I realised the big issue I’d been missing, I found a couple more places it was needed.

The other issue I struggled with is parsing blank lines.  I don’t recall all the things I tried and why they didn’t work, but I do know that pattern that ended up working well.  Importantly, this works well for optional NEWLINEs.  It is easy to end up with places in the grammar which could be preceded by something that ends with a NEWLINE, or something that was marked $$NEWLINE.  In the latter case there must be a newline following, in the former there may be.  The pattern that I use works well for both of these cases.  When you have any non-terminal that can, and possibly should, appear one a line by itself, you provide an extra production for that non-terminal which is left-recursive and adds a NEWLINE at the end. So

Line -> Type1line
      | Type2line
      | Line NEWLINE

This approach ensures that each type of line can be followed by arbitrary NEWLINEs, and that if some type of line is only marked as $$NEWLINE – meaning it must be followed by a NEWLINE – then the parser will expect and accept a NEWLINE after it.

This covers all eventualities except for NEWLINEs at the start of a sequence of lines.  For that case, we can use a right-recursive rule to put NEWLINEs at the start.

LineBlock -> Lines
           | NEWLINE LineBlock
Lines -> Line
           | Lines Line

Keeping LineBlock separate from Lines is not essential for handling blank lines, but as mentioned in a previous post, left-recursive productions cannot be closed fully by the end of an indent, unless they can be reduced into some other non-terminal.  Here LineBlock provides that other non-terminal.

One problem with this pattern is that if there are a large number of blank lines at the start of a block, they will use up a lost of that parser stack – that is a property of any right-recursive form in a shift/reduce parser.  If this is at all a concern, it can easily be alleviated by introducing a left-recursive rule for consuming the blank lines:

Newlines -> NEWLINE
          | Newlines NEWLINE

LineBlock -> Lines
          | Newlines Lines

This will recognise the blank lines at the start of a LineBlock equally well, but will use only a fixed amount of the parser stack.

I have started using these two patterns consistently, and my problems with blank lines have gone away.

So now I’m just left with one remaining issue to discuss and I think I’ll be done with parsing newlines.  I need to explain and resolve my blind spot.  This blind spot relates to error detection.

When I started thinking about parsing with indents and line breaks, I was focused on correct programs and finding a way to parse such programs correctly – in part this meant understanding what “correct” means in this context.  So I identified the various rules for when an “OUT” token or “NEWLINE” token should force a a reduction – so that everything in the two-dimension space closed by that token would be fully parsed before we moved on.  There is a flip-side to this that I didn’t consider.  The flip-side of allowing these special tokens to reduce production, is requiring them too.  i.e. in some cases, any other token must be an error.

As an example case where this is important, imagine a language where the word “function” introduces the declaration of a function, and there is no specific terminal to mark the end of the function, it depends on indentation information.  So

function foo(args):
       statements
function bar(args):
       other statements

could be two functions.  In this language, nested functions are not allowed, so the word “function” can only appear at the outer level.  If the parser for such a language is given

function foo(args):
        statements
        function

it needs to see this as an error.  With the initial rules I formed for parsing, that wouldn’t happen.  When the parser sees “function”, that would reduce the statements, and then the function, and then start parsing a new function.

The solution to this is very nearly the solution I discussed at the start of this note, marking some productions as “$$NEWLINE”.  There are three different things that I’ve found that such a marking might need to mean:

  • This production must be reduced by a NEWLINE
  • This production must be reduced by an OUT (end of indent)
  • When reducing this production, we must be at the start of a line, with no pending indents since the start of the line.

It is possible that each of these mean close enough to the same thing.  In my current implementation I have two marks, “$$NEWLINE” and “$$OUT” which both have the same effect, that being to require any one of those three conditions to be true when the production is reduced.

The important part of this last part of the discussion is how I now think about these markings.  They are explicit statements on how as two-dimensional region of text must be closed.  Thinking about it this way lead me to realize that I had placed the markers at the wrong place before.  To support a syntax like

if condition:
      statements

I had defined a Block as

Block -> : statementlist $$NEWLINE

I now think this is wrong.  It requires the “: statementlist” to be terminated, but it is really just the statementlist that needs to be terminated.  So now I want

Block -> : statementblock
statementblock -> statementlist $$OUT

One practical consequence of the difference is that in the new grammar I can indent a following else:

if condition:
       statements
   else:
       statements

In this, the “: statements” is not yet terminated, either by an OUT or a NEWLINE, when the “else” is seen.  But the “statementblock” is fully terminated, which is what I want.

The other place that I found to put “$$NEWLINE” is at the end of any “complex” statement such as if/then/else or while/do.  It is likely that such statements will end with a “Block” (as defined above) and so they don’t have an explicit NEWLINE at the end, though they may still end with a NEWLINE in practice.  By marking it as “$$NEWLINE” I can be sure that it there is no way that another statement can appear in the same 2-dimensional space.  The next thing, whatever it is, must at least be on a new line.

One thing I’m still not completely resolved on if whether it is sufficient to have just one marking (Current spelled either “$$NEWLINE” or “$$OUT”) or whether I need two, or even three, different markings to handle distinct cases.  It seems that just one is enough, but I need a lot more experience working with the grammar to gain any sort of confidence.  For now, I only know that this approach seems to be working more smoothly than what I had before.

This entry was posted in Language Design. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


one + = 9


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>