Parsing a two-dimensional language.

Many programming languages are essentially one dimensional. The parser treats them simply as a linear sequence of tokens. A program could all be written on a single line, or with each token on a separate line and the parser or compiler wouldn’t notice the difference This set of languages includes Algol, Pascal, C, Rust and many others.

Some languages are 2-dimensional in a bad way. FORTRAN is probably the best example, though BASIC is similar. These (at least in their early forms) had strict requirements as to what can go on one line and what needs to go on a separate line.

A few languages are exploring the middle ground. Go will treat Newlines like semi-colons in certain cases which can result in a 2-dimensional feel, but brings with it some rather ad-hoc rules. Python probably makes the best attempt at 2-dimensional parsing of the languages that I have looked at. It allows newlines to terminate statements and also uses indents to indicate the grouping of some language elements, particularly statement groups.

While I find a lot to like in Python, it seems imperfect. I particularly dislike using a backslash to indicate line continuation. You can avoid this in Python by putting brackets around things as a newline inside brackets is ignored. But this feels like a weakness to me.

As I wrote in a recent article for lwn.net:

The recognition of a line-break as being distinct from other kinds of white space seems to be a clear recognition that the two dimensional appearance of the code has relevance for parsing it. It is therefore a little surprising that we don’t see the line indent playing a bigger role in interpretation of code.

This note is the first part of a report on my attempt to translate my intuition about parsing the two dimensional layout into some clear rules and concrete code. This note deals with indents. A subsequent note will look at non-indenting line breaks.

Indenting means continuation, newline means completion.

The two elements that give a section of code a two dimensional appearance are line breaks and indents. Without the line break there wouldn’t be any extra dimension at all. Without indents the code would just look like a block of text and we would be in much the same situation as FORTRAN and BASIC.

Each of these has a fairly well understood meaning. A line break often marks the end of something. This isn’t always true, and exactly what the “something” is might be open to interpretation, but this meaning is fairly well understood and line breaks rarely mean anything else.

An indent means some sort of continuation. The line that is indented with respect to a previous line is in some sense part of that previous line.

To help visualise this, consider:

A B C
   D E
   F
G H

Here “D E” and “F” are clearly part of the thing which starts “A B C”. “G H” may or may not be, much as “F” may or may not be part of “D E”. However if “D E” or even “B C” started something, then it definitely finished by “F”. “G” might be part of something that started a “A”, but if some subordinate thing started at “B” or later, it does not continue into “G”.

As a first step to formalising this we need to be able to encode this two dimensional structure into the stream of tokens which the parser can then process.

Generating a “NEWLINE” token at the end of each line, and maybe a “SPACE” token for each space at the start of a line could capture it but not in a very helpful way. It is not really the presence of an indent that is important but rather the change in indent. So between “C” and “D” we need to know that the indent level increased, while between “F” and “G” it decreased.

Python uses the terms “INDENT” and “DEDENT”. The first is technically inaccurate as every line is indented, not just the first. The second is a neologism presumably derived by analogy with “increase” and “decrease” or similar. Other options might be “UNDENT” or “OUTDENT”. I’m not sure I really like any of these. However “in/out” seems the best fit for what is happening so I’ll just call the tokens which report indent changes “IN” and “OUT”.

Using these two tokens, and ignoring the non-indenting line breaks for now, the above structure can be tokenized as:

A B C IN D E F OUT G H

This is very simple as there is only one indented block. A more complex example is:

A B C
  D E
   F G
 H
I

This would be tokenized as:

A B C IN D E IN F G OUT OUT IN H OUT I

Note the sequence “OUT OUT IN” – we close two indented sections and open another one all at one line break. This is because “H” has a lesser indent than “D”, but a greater indent than “A”.

This tokenization captures the structure very effectively and is what I will be using. Unsurprisingly, these are exactly the tokens that can be generated by the scanner I previously described though the token names have been changed.

What to do with indents?

Now that we have these IN and OUT tokens we need to understand what to do with them. Their significant value is that they are very strong syntactic markers. IN and OUT are strongly paired: it is very obvious which OUT matches which IN. This contrasts with bracketing markers like “{ }” or BEGIN and END or DO and OD. The latter can be accidentally deleted or forgotten. If OUT is forgotten, that omission will be obvious on every line.

I’ve recently be writing C code using the “text” mode of my editor (emacs) rather than the C mode. This is because I’ve been experimenting with literate programming and writing the code inside `markdown` markup.

I have really noticed that lack of help with auto-indenting. Each line is auto-indented to the same level as the previous line, which is certainly useful But typing `}` doesn’t automatically drop back one level of indent. So sometimes I get it wrong.

When I do get it wrong, the compiler complains (as it should) but it is usually a cryptic complaint that is way too late. I would much rather it know there is a problem as soon as the indenting looks wrong. So error detection will be the first focus for exploring the use of indents.

To discuss how this works you really need a working knowledge of grammars and parsing. If you don’t have that yet, let me suggest my earlier note of LR grammars. If you don’t have time for that now, then press on anyway, it shouldn’t be too hard.

Unlike other tokens used in an LR grammar, IN and OUT cannot simply appear in the grammar where ever they are expected. This is because there are really too many places that an IN can appear. An OUT must be at the end of something, but that matching IN can often be anywhere in the middle. So the parser will need to have direct knowledge of IN and OUT tokens. It will not “SHIFT” them onto the stack as it does with other tokens. It has to handle them directly.

As an OUT enforces the end of something the natural thing to do when an OUT token is seen in the look-ahead is to perform REDUCE actions until the OUT is balanced by some IN that was previously seen. This is clearly a very imprecise rule. We need to explore some examples before we can make the rule suitably precise.

We can start looking at examples supposing a grammar containing:

Statementlist -> Statementlist Statement
               |
Statement -> if Condition : Statementlist
           | print String

A program fragment like:

if a == b :
   print "Hello"
   print "World"
print "End"

would then result in the following LR stack just as the OUT token appears in the look ahead:

if Condition : Statementlist print String   [OUT]

The IN appeared just before the Statementlist and the only possible reduction is to reduce”Statement -> print String” so we would do that which results in

if Condition  : Statementlist Statement    [OUT]

Again we can reduce and still stay after the indent so now we reduce “Statementlist -> Statementlist Statement” and get to:

if Condition : Statementlist   [OUT]

Now the symbol on the top of the stack is where the indent started, so we need to stop reducing. It might seem appropriate to reduce this fully to a “Statement” but we cannot do that based on just the indent. Consider a different language with a fragment like:

if ( a == b ) {
    print "Hello";
    print "World";
}

In this case it is clear that the OUT itself is not enough to close the whole statement. Resolving this will have to wait until we explore how line-structure works.

So the approach for handling IN and OUT seems to be that when we see an IN, we mark the next symbol SHIFTed as being indented. Then when we see OUT we stop SHIFTing and REDUCE until the symbol on the stop of the stack is marked as indented. We can then clear that marking, discard the OUT, and continue.

This is close to a working solution but there are a few complications that need to be resolved first. One relates directly to the fact that “Statementlist” is recursive.

Statementlist -> Statementlist Statement

This means that reducing until we just have “Statementlist” on the top of the stack doesn’t stop more things being added to that Statementlist. The idea was that the OUT should complete close the Statementlist but because of its recursive nature that isn’t really possible.

This is easily fixed by introducing a second “dummy” nonterminal.

Statementlist -> _Statementlist
_Statementlist -> _Statementlist Statement
                |

This set of productions will parse exactly the same language but will introduce an extra state. If we see an OUT token we can reduce back to a “_Statementlist” as we did before, and then go one more step and reduce to the new “Statementlist”. Once there we really have closed the list. Another statement cannot extend it like it could in the simpler grammar.

These extra “dummy” non terminals could easily be added by the parser generator itself. However they solve one problem by providing another. Previously we could simply REDUCE until the top of stack carries an IN, and then cancel the IN. Now we might need to REDUCE one more step – the “Statement -> _Statement” production above. How can we know when to stop?

The answer to this requires that we store one more bit (literally) of information in the state stack. We currently have a count of the number of indents which appear “within” (the full expansion of) the symbol in the stack entry. By that we mean the number of terminals which were reduced into this symbol and were immediately preceded by an IN. To that we must add a flag to record whether the first terminal was preceded by an IN. i.e. is the whole symbol indented or just part of it.

Now the rule will be that if we have an OUT token in the lookahead buffer and the top-of-stack token contains an IN that was not at the start then we can simply cancel the IN and discard the OUT. Further, if the top-of-stack contains only one IN and it was at the start then we need to look at the production that we can reduce. If that production contains more than one symbol in its body then again we can cancel the IN and discard the OUT. In this case, that IN would be not at the start of the symbol we would reduce to, so it is somewhat similar to the first case.

UPDATE—-

The reasoning in the above paragraph is flawed. It talks about looking at “the production that we can reduce”. That isn’t well defined (there may not be anything to reduce at some states) and it isn’t really the right thing to be talking about.

The important production is the one that will ultimately be reduce to consume the symbol on the top of the stack. If that production starts earlier than the top-of-stack symbol, then it is perfectly OK to cancel the IN with the OUT in the look-ahead as the top-of-stack symbol is exactly the text that was indented. If the production consists of precisely the top-of-stack symbol (i.e. the body of the production only has one symbol), then now is clearly the time to reduce that production, and we need to ask this question again. If the production which will ultimately consume the top-of-stack symbol starts with that symbol but needs other symbols too, then we have a problem. Those other symbols will be at a lower indent level than the start of the top-of-stack symbol, and that isn’t allowed. In that case we need to raise an error and find a way to reduce that production early.

A serious problem with the above is that we don’t necessarily know what production that ultimate one will be. LR shift/reduce parsing doesn’t identify a production until it is time to reduce it. So we need a different approach. If we ensure that all recursive non-terminals have a dummy parent non-terminal as described above, that approach is fairly easy, but does require extra information in each state. We need to know which states have any item with ‘DOT’ after the first symbol. If we find ourselves in such a state and it cannot reduce that first symbol, then we know that there is a real possibility of the indent being interpreted wrongly and need to raise an error.

END-OF-UPDATE

Another way to look at this is to say that when we see an OUT token, we reduce until the next reduction would contain an internal (i.e. not at the start) IN. Once we find this internal IN we cancel it and discard the OUT.

Indenting affects error handling in two ways. Firstly we need to know what happens if an OUT implies a reduction is needed, but the current state does not allow a reduction. Secondly we need to know how to handle IN and OUT tokens if they are found while discarding input tokens during normal error recovery.

In the first case, the fact that we want to reduce suggests that we must have part of the right hand side of a production but not all of it. It would be nice to keep what we have, so to complete it we need to synthesize the extra tokens needed to make a complete production. If there is no reducible production at this state then there must be symbols that can be shifted. So we can synthesise one of these – the one that gets us as close to the end of a production as possible. This may result in an ‘unknown variable’ or similar which would need to be carefully suppressed, but would allow any valid units with the production so far to be properly analysed.

UPDATE: Synthesizing a symbol might sound easy, but it isn’t. As well as the symbol we need to synthesis the data (typically part of an Abstract Syntax Tree) that needs to accompany it. Synthesizing that automatically is not possible. A possible approach is to use productions like “SomeSymbol -> ERROR” to synthesize symbols. The reduction code with that production can synthesize the necessary AST data. This may introduce other complications. END-UPDATE.

For the second case we can simply count the INs and matching OUTs until we find a SHIFT-able symbol or find more OUTs than INs. If we find a SHIFT-able symbol, we set the IN count for that symbol to the net number of INs. If we find an extra OUT it means that we must have come to the end of whatever high level non-terminal was in error Then we just leave the ERROR token on the stack without shifting anything else and move on to normal handling for an OUT token. This will either reduce whatever is on the stack, or will synthesize some token to allow parsing to continue.

It should be noted here that I haven’t experimented a lot with this error handling process. It seems to make sense, but often things which seem to make sense actually don’t. So this might need to be revised.

This then is the complete description for how to handle IN and OUT tokens. In summary:

Each frame in the LR state stack must contain a count of the number of IN tokens it contains, and a flag to record if one of these is at the start.
When we see an IN token we record the fact and discard it. When we SHIFT in a new terminal we check if we just saw an IN. If so the new entry is the stack records that one IN is present, and that it is at the start.
When we reduce a series of symbols to a single non-terminal we add up the number of included IN tokens in each and use that as the count for the new token. The “one at the start” flag from the first symbol is inherited by the new symbol.
When we see an OUT token we check if the top-of-stack contains a “not at the start” IN token, or if it contains an “at the start” IN but the next reduction would make it “not at the start”. If either of these are true we decrement the IN counter and discard the OUT. Otherwise we REDUCE a production if we can, and synthesize some suitable symbol to SHIFT if we cannot.
When we discard stack states during error handling, we count the number of indents and add them to the “ERROR” state that we ultimate SHIFT.
When we discard look-ahead tokens during error handling we keep count of any IN and OUT tokens. If we see an OUT token when the count is zero, we assume a sane state has been found and return to where step 4 above was handled. Otherwise the IN count of the token which ultimately resolves the error is set to the balance of INs and OUTs.

The net result of this is that an OUT always closes anything that was started at or after the matching IN, but nothing that was started before the matching IN.

This adjunct to LR parsing can be used in two different but related ways. Firstly as already noted it can improve error reporting. If a ‘closing’ token like “}” is missing then that will be noticed immediately The syntactic unit that it is meant to close (e.g. Statementlist) will already have been reduced by an OUT and if the “}” doesn’t follow a regular LR parsing error will occur.

Secondly, and more interestingly for language design, an ambiguous grammar can be cleanly resolved by treating indentation as meaningful.

The classic “dangling else” case can be easily resolved by the indentation of the ‘else’.

if (COND1)
    if (COND2)
        StatementA;
else
    StatementB;

is no longer ambiguous. The “else” clearly relates to the first “if”, not the second.

Ocean will make use of this to have a syntax for “blocks” of statements which does not require a closing token. They will be closed by a reduction in indenting. As explicit closing can sometimes be preferred, Ocean will also allow C-style blocks. Possibly that should be “go-style” blocks as a block must not be a single statement.

The syntax I’m experimenting with at the moment includes:

Block -> : Statementlist
       | { Statementlist }
Statement -> if Expression Block
           | if Expression Block ElsePart
           | ...
ElsePart -> else Block
          | else if Expression Block
          | else if Expression Block ElsePart

This allows structured statements that look like Python or like Go, and will use indentation to resolve ambiguities in the Python style, and report errors in the Go style.

My intention is that several other structures will allow either “{}” or “:..” formats. “struct” type declarations is an obvious one. Others might occur as design proceeds. However before we can proceed with those details we need to finish the two-dimensional parsing which is only half done. This note dealt with indents, the next note deals with end-of-line marker.

Meanwhile code that manages indents as described here can be found in `git://neil.brown.name/ocean` or this version of the parser generator.

Indenting means continuation, newline means completion.

What to do with indents?

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

OpenPhoenux