Structured statements and expressions in Ocean

Now that I have my general parsing worked out and understand how I want to use indents and line breaks to give a two dimensional structure to my language, I need to think about the details of some of the elements of the language.  The part of a language that we use the most is the part which does stuff: statements and expressions.  So that is where I will start, though particularly with statements.

As has previously been hinted, blocks of statements will have  two different syntaxes that the programmer can choose between, one Python-like and one C-like.

Block -> : StatementList
       | { StatementList }

This is a simplified version of the grammar, but the idea is that a StatementList can follow a colon, or be enclosed in braces.  In the first case the StatementList must be terminated by a decrease in indent.  i.e. all of the StatementList must be on the same line as the colon, or indented with respect to the line containing the colon.  When we find a line with the same indent as the line with the colon, we know the StatementList is finished.

In C, any expression can be used as a statement simply by appending a semicolon, and various thing that you might think of a statements – such as assignment – are syntactically expressions.  In GNU C (i.e. C as implemented by GCC) you can even convert arbitrary statements to expressions by enclosing them in “({” and “})“.  This synergy between statements and expressions can be very powerful, but can also become ugly and confusing.  It works well for small statements placed where  expressions are expected, but quickly breaks down as statements grow.

“Rust” takes this synergy to it’s logical conclusion and every statement is an expression – there is no difference between statements and expressions.  So you can write things like

a = if some_test {
            some_value;
     } else {
            some_other_value;
     }

I personally find this becomes clumsy.  Also as I wrote for lwn.net, it creates some oddities in how semicolons are handled.

“Go” deviates from C in the other direction and allows a much more limited interplay between statements and expressions.  Assignments and increments become true statements, but in a few places were you would expect an expression, a simple statement is allowed as well.  So:

while line, ok = readline(file); ok {
         do something with line;
}

allows a simple statement before the condition in a while. This is certainly useful, but seems a bit artificial.  It solves many cases, but isn’t really a general solution to the problem.

I like allowing general statements in the head of a while and similar places, but I don’t want to make them look like expressions – I still want a clear distinction.

So a while statement will have two forms

WhileStatement -> while Expression Block
                | while Block do Block

The first form is the one we are familiar with, though it is more like “Go” than “C” as the expression does not need to be in parentheses.  The second form allows for more complex calculation of the loop test.

Naturally the first Block in the second form must return a value somehow to guide the progress of the loop.  This is somewhat like the “return” statement in “C” which can return a value from any point in a statement block.  Using “return” in this “while” Block would be confusing as the normal expectation would suggest the meaning was to leave the containing function.  Instead I’m planning to use “use” as a keyword.  That is certainly open to change after experimentation, but it is close to what I want.  So:

while:
     line = readline(file)
     use not line.matches("END"):
do:
     print line

is a contrived example which shows how the “use” keyword might be used.

A noteworthy aspect of this example is that the “condition” part is larger than the “body” part.  One could go even further and imagine a body part which was simply “pass”.  This would result in a loop much like C’s “do {} while()” loop.  It may not be quite as good, as the terminating condition doesn’t syntactically stand out, but careful placing of the “use” statement can make it stand out well enough.

This generalises to having largish “condition” and “body” parts, thus allowing loops which exit in the middle and answering Dijkstra’s “n and a half loop” problem, as discussed particularly by Knuth in his Structured Programming with go to Statements.

In that article, Knuth also talks about “event indicators” which he credits to C .T Zahn. These are an attractive idea and fit quite well if we allow these expression statements in a switch construct.

switch:
      some code which can "use" multiple different values
case value1:
      code
case value2:
      code

Naturally the type of each of the values listed with “case” clauses must be the same as the types of values given in “use” statements.  For “event indicators” Knuth suggests that the list of allowed events might be given at the top of the statement.  This might make life easier for the compiler, but seems unnecessary.  We could allow that if all of the “use” statements give currently undefined words, and if all of those words appear after a “case” tag, then the words are taken to be values in a new anonymous enumerated type.

These enumerated values would behave a lot like “goto” labels, in that they must appear once as a target and must be used, but otherwise do not need to be declared.  Indeed, there is a lot of similarity between these “event indicators” and goto targets, and I would not consider adding a general “goto” until experimentation with the event indicators showed they weren’t enough.

In Knuth’s paper, the  event indicators can be used in loops as well as non-looping statements.  In “C”, a switch doesn’t loop,  so we clearly haven’t attained complete functionality yet.  Without detailing all the steps, all these above thoughts lead to the idea of a general structured statement with several components:

for  BLOCK or simple statement
[if | while | switch] BLOCK or condition
then BLOCK
do BLOCK
case VALUE BLOCK
else BLOCK
  • for” provides preliminary code which is only run once.  It may declare variables which remain for the entire statement, but it contains no “use” statements.
  • if“, “switch” and  “while” are much as you would expect and if they have BLOCKs, those BLOCKs contain “use” statements, the value of which guide the execution of the other components.
  • do” is only present with “while” and just runs code providing the condition doesn’t fail.
  • then” and “else” are run if the condition or BLOCK returns “True” or “False” respectively.  As such, they are somewhat equivalent to “case True” and “case False
  • the “case” stanzas allow the condition to return a variety of values.  If none of the “case” values match, then the else” clause is used if present.

If the values returned by the condition in a “while” are not Boolean, then it isn’t immediately clear which values cause the “do” block to run and the statement to loop.  My current thinking is that if the condition does not return a value at all, then that is equivalent to “True”, in that the “do” block is run and the statement loops.  If the condition returns any value other than “True”, then the loop aborts.

If a (“while“) loop starts with a “for” clause, then there can also be a “then” clause after the “for” and before the “while“.  Despite its positioning it is run after the “do” part if the condition didn’t fail.  This allows an effect similar to the C for loop:

for a = 0
then a += 1
while a < 10
do sum += a

There are aspects of this that feel clumsy, though I suspect they can be resolved.

  1. Requiring the ‘then’ keyword isn’t good. If the condition is a block then it makes sense, but if it is just a condition, then it isn’t wanted. So:
    if condition:
        statement
    --or--
    if:
        use condition
    then:
        statement
  2. A large switch statement has no obvious end.  If a new line at the base indent level starts ‘case’ or ‘else’ then the switch statement continues, otherwise it is terminated.  This fact has some precedent in that languages with an “elseif” keyword can have a list of clauses each beginning with “elseif” which is only really terminated by a line that starts with something else.  That doesn’t seem to cause confusion, so maybe “case” with no termination won’t either.
  3. The word “then” isn’t quite perfect for the “increment” part of a “for” loop.  It isn’t too bad as one could say “while this do that and then the other” so “then” suggests something to be done a bit later.  It might just take some getting used to.
  4. I’m not even sure about “use“.  I like the concept.  I don’t must like the word.  I really don’t want a bare expression where you might expect a statement and that could be syntactically confusing.  I wonder if “select” would work, as it selects the case to exit through.

At this stage I am not planning on including “break” or “continue“.  “break” can sometimes be achieved with “use false“.  In a few contexts, “use true” will do the same as “continue“.  There might be times where these directives are still wanted in the “body” of a loop, but I hope not so I will wait until I find some convincing evidence.

This certainly isn’t the final word on loops.  At the very least I expect to add a “foreach” loop to work with some sort of “iterator” type.  However I cannot really explore that until I have a type system.

For now, I have  put together a simple interpreter which allows some experimentation with the loops and conditionals discussed here.  Now I need to think of some interesting loops and selection code that I can try it out with.

The interpreter is in my ocean git tree (git://ocean-lang.org/ocean), or you can read the literate source code.

Posted in Language Design | Leave a comment

A case for multiple completed items in LR parser states.

In my earlier note about LR parsing I observed that  many simple grammars will only ever have at most one REDUCE action in any given state. This means that there is no need for an “action table” to list which of several productions to reduce based on different look-ahead symbols.  I even went so far as to say:

I personally cannot see why you would ever want a grammar which had two completed items in the same state.  It means that some sequence of input tokens could be treated as one thing or as another depending only on what comes next.  That sounds a like a design mistake to me.  Maybe I’ll eat my words later, but for now this means I cannot find a useful example – sorry.

I have since found some examples which shed valuable light on this issue and give me a chance to see how my words taste.

In the first case, the “sequence of input tokens” that could have varying interpretations is empty. It hadn’t even occurred to me to consider such a case, but empty sequences can be important.  They can be particular important for optional elements. In this case optional prefixes.

Consider the grammar:

Number -> Sign Digits
Sign -> -
      | +
      |

This says that a “Sign” is a “+” or a “-” or nothing, and that a “Number” is a “Sign” followed by “Digits” (whatever they are).  In this case, if the next symbol is whatever “Digits” starts with, then we need to REDUCE the empty sequence to “Sign” before we can SHIFT that symbol.  By itself, this doesn’t cause a problem.  If we see a “+” or “-” we SHIFT, otherwise we REDUCE.

However suppose that variable names can be prefixed with “$” or “@” with some  unspecified meaning, and suppose we have:

Value -> Number
       | Variable
Variable -> Sigil Name
Sigil -> $
       | @
       |

Now if we want a Value and see a ‘$’ or an ‘@’ we assume a “Variable” is coming, while if we see a “+” or a “-” we assume a “Number” is coming.  But what if we see something else?  If we see something that starts a “Name”, we should reduce the empty sequence to “Sigil” while if we see something that starts “Digits” we should reduce the empty sequence to “Sign”.  So here is a case where “some sequence of input tokens could be treated as one thing or as another depending only on what comes next“.  The “sequence” is empty, the “one thing” is a “Sigil” and “another” is a “Sign”.

This situation would be even more interesting if there was some symbol that could be either a ‘Sign’ or a ‘Sigil’.  A grammar which included that could still be parse with an LR(1) parser (providing Names and Digits could be distinguished in their first token).  However that is something that I would consider a design mistake.  I don’t need to be able to handle that, but I would like to be able to handle distinct optional prefixes.

A similar situation came up as I was trying to design the “Newline” handling for my “ocean” parser.  After  an “if Expression Statementlist” I want either 1-or-more newlines, or o-or-more newlines and an ‘else’.

CondStatement -> IfPart IfSuffix
IfSuffix -> Newlines
          | OptNL else Block
Newlines -> NEWLINE
          | Newlines NEWLINE
OptNL -> Newlines
          |

After parsing the “IfPart”, a “Newline” should always be SHIFTed and then reduced to “Newlines”.  But otherwise we can either REDUCE that “Newlines” to “IfSuffix”, or to “OptNL”, the latter if “else” is the next symbol.  So again the one sequence can be treated as two different things (Newlines or OptNL) depending on what comes next.

These can both be resolved without changing the parser by spelling things out a bit more:

Number -> + Digits
        | - Digits
        | Digits
Variable -> $ Name
          | @ Name
          | Name
IfSuffix -> Newlines
          | Newlines else Block
          | else Block

This requires “Digits” and “Name” and “else Block” to appear multiple times, but otherwise is reasonably simple and has exactly the desired effect.

So what to do?  One option would be to manually apply transformation like the above when the problem arises.  That is simplest and what I have done so far, but it isn’t entirely satisfactory.  Repetition, as it requires, feels clumsy and is error prone.

Another option is to bite the bullet and  change to a full LALR grammar with a more complete “action” table.  This would be awkward because it still means that some states allow multiple different REDUCE steps.  The algorithm I developed for handing indents and newlines requires that in some circumstances we force a reduction.  That algorithm would need to be refined to be able  to identify which reduction should be applied.  It isn’t immediately obvious how that would work.

We could teach the parser generator to apply the same transformation that I applied above.  Whenever any production contains a nullable symbol it is duplicated to have a second production which doesn’t contains that symbol.  Then any production that causes the symbol to derive to the empty string gets disabled.  This could probably be made to work, but seems like it could make the grammar much more complex, and so make it harder to understand when things go wrong.

I think what I would like to do is find a more minimal transformation of the grammar.  One which still results in at most one REDUCE action per state, but which as close as possible to the original.

If we consider the second example with the newlines, then the problem occurs in a state with the following itemsets (as reported by my parsergen program with –LALR enabled).

    Newlines -> Newlines . NEWLINE [14]
        LOOK AHEAD(6): NEWLINE, $eof, else
    OptNL -> Newlines . [15]
        LOOK AHEAD(5): else
    IfSuffix -> Newlines . [18]
        LOOK AHEAD(0): $eof

If we consider the middle item, effecting the REDUCE operation that it implies would pop the “Newlines” off the stack and replace it with “OptNL” with an updated ‘state’.  This state would allow ‘else’ to be SHIFTed, but otherwise would be ignored.  As “OptNL” and ‘Newlines” are very similar symbols, neither of which carry any extra ‘AST’ information, there would be no harm in just leaving the “Newlines” on the stack (as we noted previously, the symbols on the stack are hardly used)  and directly SHIFTing in the ‘else’.

So if we add to the ‘goto’ table for this  state an entry indication that ‘else’ can be shifted and go to whichever state ‘else’ ultimately leads to, and otherwise ignore this item so that the only reducible item (or “completed item”) is “IfSuffix -> Newlines .”, then the parser will be able to correctly parse the language without the multiple REDUCE problem.

The other example is a little different and not so simple.  The problem item there contains, among many other items:

    Sigil -> . [8]
        LOOK AHEAD(1): Name
    Sign -> . [12]
        LOOK AHEAD(2): Digits

There are several differences here.  Firstly reducing the productions does change the stack – it pushes a single frame on.  Secondly, both production needs to be treated the same.  In the first case we could say “OptNL” is similar to “Newlines” while “IfSuffix” is very different.  That distinction doesn’t apply here.

We cannot convert either of these into simple SHIFT operations like we could previously, but a “REDUCE+SHIFT” operation where the shift is very simple might be an option.  The “REDUCE” part would not pop anything off the stack and just push on a very minimal frame.  This would require that there was no code associated with either production.  The symbol either has type “void”, or has some small type (say: 8 bytes at most) which is left as zeros.

Pulling this all together, I imagine two changes.

Firstly, an extra bit is added as a flag to each entry in the “go to” tables.  We can probably steal a bit from the “state” value and so limit our grammars to 32767 states.  If the parser chooses to shift a symbol which leads to a ‘go to’ entry with this bit set, it first shifts in an “empty” stack frame with no symbol, no state, and an ‘asn’ (Abstract Syntax Node) with 8 bytes of zeros.  This is the only change made to the parser.

The parser generator is changed to translate some ‘completed items’ into SHIFT actions instead of REDUCE actions.  This only applies when the production in the item has no code fragment attached and the structure attached to the head of the time is no larger than 8 bytes.  It also only applies if all symbols in the item’s look-ahead set are not present in other terminal items.

If the body of the item is empty, then a ‘go to’ entry may be created with the new flag set.  If the body contains a single symbol as the same type as the head of the item, a regular ‘go to’ table entry might be created.

There are various other details about how to determine the target state and how to decide if the transformation is legal.  A particular situation that would make the transformation impossible is if the state in the synthesized frame might ever by used.  These details would need some careful thought and analysis which I don’t feel is particularly justified just now.  After I have had more experience, and possibly found more interesting case, it might then be time to explore extension to the parser generator in greater detail.

So there is no code to show this time, just an admission that I still have things to learn…

Posted in Language Design | Leave a comment

Line breaks – a reprise.

In two previous articles I explored an approach to enhancing an LR parser to work with indents and line breaks.  While I discovered some useful ideas and produced some code that seemed to work, I’ve subsequently discovered some serious flaws in the reasoning.

For indents, my reasoning about exactly when to REDUCE in the face of an OUT token was flawed and didn’t properly address all cases.  I’ve made a few updates to that article to highlight this failing.  For linebreaks, I only talked about when they should be ignored and didn’t cover the other important question of their role in terminating things.  I hadn’t at the time seen how important that was.

So now I want to rectify these problems and present a more complete solution.  As I have explored around these problems I’ve seen other smaller issues and made a number of changes to my approach.  The big picture is still much the same but some of the details are different in important ways.  I also think I understand it all much better and so will try to explain things more clearly.

The Tokens

The tokenization of indents and linebreaks is unchanged.  Every change in indent results in an IN token or an OUT token, which are paired.  Every line break results in a NEWLINE token.  However the NEWLINE token that would be expected before an IN is moved  to after the matching OUT.  So every OUT is both preceded and followed by a NEWLINE.

Recursive Symbols.

Understanding recursive symbols is very important for managing indents and linebreaks.  A key property of both NEWLINE and OUT tokens is that they might “close” something. i.e. they sometimes  indicate that everything since the start of the line, or since the matching IN, forms some unit which cannot be further extended.  This “unit” is some (non-terminal) symbol (A) and has some production for this particular instance:

A -> B C D

If the top few symbols on the parse stack are “B C D” and the context is such that an ‘A’ was expected, then we might assume that we are at the end of the production and it cannot be further extended.  However if any of B, C, or D are recursive, then that isn’t the case.

A “recursive” symbol here is one with a production which starts with the same symbol. e.g.

D -> D E

If such a production existed then having “B C D” on the stack does not mean we are at the end of the “A”, as an “E” could come along producing “B C D E” which could be reduced to “B C D” and so the “A” has moved further into the input stream.

Obviously if we found “B C D” on the stack when we saw an OUT or a NEWLINE we  could reduce that to “A” and be sure that “A” was closed.  However if “A” is recursive that isn’t enough.  If “A -> A C” were a production, then a “C” could come along and extend the A.

The conclusion is that recursive symbols require special care.  We cannot simply disallow them as they are very useful for any “list” like construct.  So we need to make sure we can always remove them from the stack.

As suggested in a previous article, the best approach is to create an intermediate symbol “$A” for each recursive symbol “A”, add the production “$A -> A”, and then wherever “A” appears in a non-recursive position in the body of a production, replace it with “$A”.  So in the final grammar a recursive symbol will only appear in the body of a production when it is also in the head of that production, or in the synthesized “$A -> A” production.  Then when we have ‘A’ on the top of the stack and need to ensure it is closed, we simply reduce it to “$A”.

Understanding the Stack

In my original implementation I constructed the stack in a less than optimal way which made some of the code quite clumsy.  An important property about the stack is that it contains symbols and states, and these alternate.

At the start, state “zero” is pushed onto the stack.  Then the first token is pushed on as a terminal symbol.  The symbol is looked up in the “go to” table for the state below it, and the new state is pushed onto the stack.  Then we either REDUCE or SHIFT but in either case we will add a symbol, and then use the “go to” table to find the next state.

My original implementation combined the state with the following symbol into the one frame.  This seemed to make sense as a state was on the bottom of the stack, so a state should go on first.  However another way to look at it is that each state is intimately connected with the preceding symbol as the symbol leads directly and immediately (though a “go to” table) to the state.  The code turns out to be much nicer if a state is combined with that preceding symbol to form a frame, though maybe it is better to say, a symbol together with the subsequent state.

The first frame on the stack then contains state zero and no symbol.  This feels a little bit untidy, but makes other things much cleaner.  As the state never changes from zero we could possibly do away with that frame altogether, but that isn’t really a net win.

With this clear understanding of the stack, as “symbol -> state” pairs, we need to ask how pending indents are recorded.  As we determined before, indents often appear inside symbols and keeping a count together with the symbol is important and unchanged.  The interesting question is how to store the indent that might be at the start of a symbol.  To put this another way, an IN token appears between two tokens.  Should it be counted with the one before, or with the one after?

We could argue that the very first token might be an IN and it would have no previous token to store the count with.  However we have already seen that the first frame on the stack has an absent token.  If the first parsed token was an IN, it could be counted with this absent token in the first stack frame.  We could even pretend that there was a “start-of-file” token to match the “end-of-file” token which terminates the parse.  However this is just book keeping and doesn’t really provide a firm answer.

In my previous analysis I counted each IN with the following token and found the need to distinguish the case of an indent being at the start of the symbol, against all the indents being within the symbol.  This tells us that the indent immediately before a symbol doesn’t entirely fit with the indents within the symbol.  So maybe it is best to count the IN with the preceding symbol.

We are getting a bit ahead of ourselves here, but one of the important places were indents are needed is determining when to ignore NEWLINEs. This determination is based on the states that are in the stack.  Some states identify as “starts_line” and if there is such a state on the stack with no subsequent indent, then NEWLINEs are not ignored.  If there is an indent since the last “starts_line” state, then NEWLINEs are ignored.   However an indent between the same two symbols as the state is between, does not cause newlines to be ignored.

This suggests that if an IN and a state are between the same pair of symbols, the IN should be “before” the state.  This is most easily handled if the IN is counted with the preceding symbol.  So that is what we will do. It removes all the need to track whether an indent was at the “start” of a symbol, as it never will be.

Which states expect a NEWLINE

One of the important properties of indents is that they cause newlines to be ignored – sometimes.  An indent signals continuation, the newline that was before the indent has already been moved to after the indent to reflect this continuation, but what about newlines between the IN and the OUT?  If the syntax elements in there don’t expect newlines to terminate or separate them, then any newlines found need to be ignored.  But how do we know?

I talked previously about “line-like” symbols in the syntax but wasn’t entirely happy with the definition.  They were states that were followed by things that other times followed newlines, and it isn’t really clear what that means.

Both an OUT and a NEWLINE will trigger REDUCE actions to close out pending productions which could otherwise extend beyond the OUT or NEWLINE.  They need to eventually get one symbol (non-recursive when an OUT is present) on the stack which represents the indented section (and possibly what lead to it), or the line (or lines).  This symbol needs to be found before the NEWLINE is SHIFTed.  So a “line-like” symbol must be followed by a newline.

The important difference here is that I was previously imagining that a line-like symbol would end with a newline, but that didn’t seem to work out.  It seems to work better if NEWLINEs are treated as separators rather than terminators so the NEWLINE is  not part of the line-like symbol, but part of something bigger (block-like?) which contains multiple lines.

Following this observation, the parser generator needs to determine which symbols can start with a NEWLINE (previously it determined which could end with a NEWLINE), and then any symbol which is followed by a starts-with-newline symbol is deemed to be a “line-like” symbol.  Finally any state which is followed by a “line-like” symbol is “starts_line” and must expect a NEWLINE.

This seems to make sense.  In a state which expects a symbol that is normally followed by a newline, we need to not be ignoring newlines.  In any other state it is safe to ignore newlines if there has been an indent.

A core idea here, important when designing the grammar, is that when NEWLINEs are not ignored, each line will be reduced down to a single non-terminal symbol, which is followed by the NEWLINE.  A NEWLINE should never be placed, in the grammar, after something that is just part of a line, e.g. `Variable = Expression NEWLINE`.  I must be after a line-like thing: `Assignment NEWLINE`.

What, exactly, to reduce when we see a NEWLINE or OUT.

When we see a NEWLINE in a context where a NEWLINE is expected (not ignored) we want to reduce everything back to the start of the line.  This means that we need to record where lines started in much the same way that we need to record where indents were found.  We don’t need to count how many there were, just whether there was one in, or following, a symbol or not.

This became particularly apparent to me when considering the parse of

if Condition : simple-statement

The grammar is (deliberately) ambiguous in that a NEWLINE at the end of this line could separate the “simple-statement” from a subsequent statement within the body of the “if”, or it could separate the whole “if” statement from a subsequent statement at a higher level.  Both the “if” statement and the “simple-statement” are line-like.  Clearly (I hope) the whole needs to be reduced (to a “statement”), right back to the start of the line.

So when we process (SHIFT or DISCARD) a NEWLINE or IN (both of which signal the start of a line), we record a line-start in the topmost stack frame  (just after the symbol in that frame, and before the state).  We also keep track of how many stack frames do not contain a line-start.  This is easily done by storing zero when a line-start is present and storing one more than the previous frame when there is no line-start.

Then when we see a (non-ignored) NEWLINE and it cannot be SHIFTed we must obviously REDUCE or trigger an error.  If we can SHIFT the NEWLINE we may still want to REDUCE first.  Specifically, if the line isn’t already reduced to a single symbol, and if we can perform a REDUCE with the number of symbols being popped not more than the distance to the nearest start-of-line, then we perform the REDUCE.  If there is only only symbol in the line, if we cannot REDUCE, or if the reduction would pop more off the stack, then we know that when we SHIFT the NEWLINE it will serve to separate the whole line from whatever comes next.

When we see an OUT, we similarly want to REDUCE until we find the matching IN on the stack.  In the original design I would reduce until the top-of-stack contained an IN, and then would cancel that.  That is still true but only if it is possible.  We need to understand what to do when it isn’t possible to reduce but the top stack frame doesn’t contain an indent. Consider a general line-like symbol with a production:

A -> B C D E

This could be found with an IN after B, then an OUT and a lesser IN after D, as in

 B
    C D
  E

This would tokenize as “B IN C D NL OUT IN E NL OUT NL”.  The first two NL would be ignored as they are indented.  The first OUT should reduce back to D, but no further – even though the most recent IN is still two frames away. The second OUT would reduce to E with the nearest IN 1 frame away, but it requires the final NL to reduce the whole thing to A.

If we can reduce so that the top of stack contains an indent it is definitely safe to cancel the OUT.  If we cannot reduce that far then sometimes it is safe, as above, but sometimes it will not be – we would instead need to enter error handling because some production is still open that needs to be closed but is not yet complete.  To be able to tell if some production is “open” we need to know details about the pending productions but a state doesn’t really carry that information.  It is all present in the “itemsets” that are used to build the states.  It seems we need to preserve some more information from that parser generator into the parser.

The itemset which is the basis for each state contains a number of productions each with DOT at some point in the body.  Those productions where DOT is at the start are not really open yet.  There is no sense at all in which they are committed to and no contents have  been found.  So there is no need to close them.  The  remainder must all start before the indent that is being cancelled, much as in the example above, the “A -> B C D E” production started before the first indent.  If a production in the state starts at or after the recent indent then cancelling the OUT would be premature.  For the parse to be able to detect this case it needs to know the least non-zero location of DOT in any production in the itemset which leads to each state. We will call this the least prefix.

This number is trivial to calculate and pass to the parser.  With it the parser can easily make the decision to REDUCE, ERROR, or CANCEL when it sees an OUT.

  1. If the current state can be reduced and the length of the reduction is not more than the distance from the most recent indent, then we reduce.  This means that of all the stack frames popped as part of the reduction, only the first can contain an indent.  As a consequence any reduction of length zero or one will happen, and so any recursive symbol will be reduced away.
  2. Otherwise if the least prefix of the state contains an indent, then the top most indent is cancelled and the OUT is discarded.
  3. Otherwise we cannot close something that is wholly within the indented section and must raise an ERROR.  This will allow us to remove things from the stack until the OUT can be cancelled.

So there are some similarities between reducing on NEWLINE and OUT, and some differences.

The similarity is that we need to track where  the “other end” (start of line, or IN) is and force a REDUCE within that range, even if a SHIFT is possible.

One difference is that for OUT, the range includes the symbol containing the IN and we reduce anything in that range.  For NEWLINE, the range excluded that symbol and we only reduce if there are at least 2 symbols in the range.  This is related to the fact that recursive symbols are problematic for IN/OUT management but not for NEWLINEs.  With newlines, the NEWLINE itself will ultimately be shifted which will stop a recursive symbol from being extended inappropriately.  With IN/OUT, the OUT is not shifted, so we must completely close any recursive symbols first.

The other difference is that an inappropriate NEWLINE causes a normal error because it cannot be shifted. An inappropriate OUT causes an error because the state says it cannot be cancelled.

Summary of details

Now that we know how it can all work, we need to make sure we have the details straight:

  • The parser generator needs to prevent recursive symbols (with productions “A -> A …”) from appearing other than at the start of the recursive production, or as the sole body element in another production.  If they do a new symbol is created with a single production “$A -> A” is added.  This new symbol is used in place of the recursive symbol where it isn’t allowed.
  • The parser generator needs to record which states expect a newline.  These are precisely those that can be followed by a line-like symbol.  These, in turn, are those which can be followed in a production by a NEWLINE or any symbol which starts with a NEWLINE.
  • The parser generator needs to record the minimum non-zero distance from the start of each item to DOT in each itemset.  This “minimum prefix” is recorded for each state.
  • The state stack needs to contain
    • A symbol (possibly ‘$start_of_file’) with associated parse information (e.g. abstract syntax tree fragment),
    • the state which follows that symbol (state zero for the first frame on the stack),
    • the number of indents within or immediately after the symbol.  These are considered to be before the state,
    • a count of the number of stack frames since the start of  a line.  Zero implies an IN or a NEWLINE is within the symbol, or that this is the first frame. A larger number is one more than the number in the previous frame and implies that the symbol contains no NEWLINEs or unpaired IN tokens.
    • a count of the number of stack frames since the last unmatched  IN.  As all states with a non-zero count will have zero for the number of indents, this value can be stored in the same field as the indent count, but with a different sign.
    • A flag “expect_nl” indicating whether an indent or a “starts_line” state is closest to the top of the stack. It is true if “starts_line” is closest.  If the state in a frame is “starts_line”, then the flag is “true” no matter whether there are indents.  Otherwise the flag is true only if it is true in the previous frame, and there are no indents in this frame.
  • When an IN token appears in the look-ahead, we increment the indent counter in the top stack frame, set the frames-since-start-of-line counter to zero and set the frames-since-last-IN counter to zero
  • When an OUT token appears in the look-ahead we REDUCE if the current reduction uses no more symbols than the frames-since-unmatched-IN count. If the state cannot be reduced or has a longer reduction pending, then cancel the OUT against the top-most indent if the frames-since-last-IN counter is at most the minimum prefix of the current state, otherwise initiate error handling.  This decrements an indent counter.  If that becomes zero we need to increase the frames-since-last-IN counters in the upper regions of the stack.
  • When a NEWLINE token appears in the look-ahead, if “expect_nl” in the top frame is false, then the NEWLINE is discarded.
  • When a NEWLINE token appears and “expect_nl” is true, then if the current state can REDUCE and the reduction length is no more symbols than the frames-since-start-of-line count, and that count exceeds 1, then we REDUCE.  Otherwise we SHIFT if we can, REDUCE if we can’t, and ERROR if we cannot even REDUCE. In any case we set the (new) topmost frame to have zero frames-since-start-of-line.
  • When we reduce a series of frames,
    • The number of indents in all the frames are summed and the result saved in the new frame
    • If any frames-since-start-of-line value is zero, the new frame gets a zero, else it gets the value from the first frame
    • If there is no indent, then the frames-since-unmatched-IN is set to one more than the previous frame.

Implications for grammar design.

The most important implication this has on grammar design is that any symbol that appears immediately before a NEWLINE (or before any symbol that starts with a NEWLINE)  is considered be a line-like symbol.  As such, NEWLINEs found while collecting the symbol will not be ignored, even if the symbol is on an indented line.

So:

Statement -> Variable = Expression NEWLINE

would cause “Expression” to be treated as line-like, so in

a =
  b + c +
  d

The NEWLINE after the first ‘+’ would not be ignored and would cause an error.

Consequently the NEWLINE should be placed after a Statement or similar, possibly like

Statement -> Variable = Expression
StatementList -> StatementList NEWLINE Statement
               | StatementList ; Statement
               | Statement

This will expect NEWLINES wherever a StatementList is expected, but not after a subsequent indent.

Another effect to be careful of is the handling of optional NEWLINEs.  A construct like

OptionalNewline -> NEWLINE
                 |

will sometime work, but often not.  In particular, if this occurs at the end of a section that can be indented, an OUT NEWLINE sequence could appear where OptionalNewline is expected.  The OUT will likely force a reduction of “empty” to OptionalNewline, where the intent was to capture the following NEWLINE.

One was to handle this is to tie the NEWLINE to the following symbol.  So if you wanted a newline before a ‘{‘ to be optional, then instead of “OptionalNewline {” you need

Open -> NEWLINE {
       | {

and then just use “Open”.  In this case, the OUT cannot force a reduction.  The reduction won’t happen until the ‘{‘ has been seen.

Error Handling

A couple of places in the preceding text I’ve mentioned error handling without any details.  Some details are in previous posts but they are imperfect.  In particular, where it is suggested  that a symbol to SHIFT might be synthesized, that is unworkable.  While the symbol itself could be synthesized automatically, the abstract-syntax-tree fragment for it cannot.

Full details on error handling will have to wait for another post.  One observation I can make now though is that symbol synthesis can be probably be managed  with productions like:

Expression -> ERROR

The reduction code with this can create some appropriate AST element.  Exactly how this will be integrated remains to be seen.

Indents in action.

This article is being published months after I wrote most of it.  This is largely because it took too long to get the code working. There were a few little bugs in my design and I found it hard to find the time to concentrate and find them and fix them.  This is part of the cost (and benefit) of wanting to have working code for all my ideas…

The current parser generator implements almost all of these ideas and can demonstrate working code.  It doesn’t current automatically hide recursive symbols, you have to do that step by handle.  Accompanying this I have a trivial grammar for testing indents and newline handling. This will read in something that looks like code with non-line-like expressions and line-like statements and will print out the parsed structure.

If you collect git://ocean-lang.org/ocean and “make ; make tests” in “csrc” you will see it run.

 

Posted in Language Design | Leave a comment

Linebreaks in a two-dimensional language

My previous note introduced the problem of parsing a two-dimensional language and the need to pay attention to indenting and line breaks and use them to guide the parsing of the language, both to enhance error detection, and to resolve ambiguity. The key intuitive insight which guides this investigation is that indents imply continuation while line breaks terminate things – sometimes.

That first note only looked in detail at indents.  It  provided rules by which a reduction in indent level can force the end of a syntactic unit by triggering a REDUCE operation in the LR parser. This note completes the solution by exploring and describing how line breaks can be interpreted when parsing a sentence in a two-dimensional language.

To start getting a feel for how line breaks work it will be useful to have an example to work with

1: if condition:
2:    a = b +
3:        c +
4:        d
5:    print a
6: else:
7:    print "message"
8: return

Here the numbers are the start of the lines are just for our reference and not part of the program fragment.

The handling of indents described in the previous note will ensure that, for example, the expression that starts on line 2 with “b +” will be terminated before line 5, as line 5 has a lesser indent.  However the indents by themselves do not tell us that the assignment starting on line 2 is a separate statement from that on line 5.  To determine that, we need to process the line breaks.

Some of the line breaks have little meaning.  Those at the end of lines 2 and 3 could be moved back or forward over at least one token with no important effect.  Also the line breaks at the end of lines 1 and 6 are only really important in that they introduce an indent.  The fact of the line break is otherwise uninteresting.

On the other hand the line break at the end of line 7 closes the “print” statement on that line, as well as the “if” statement which started on line 1.  An alternate reading is that it closes both the “print” statement and the “else” clause that started on line 6.  With the else clause closed, the “if” statement must be closed.

So it appears that sometimes we need to discard line breaks and other times we need to multiply them.  How can this work?

The apparent need to multiply line breaks can be handled by moving some line breaks that are otherwise uninteresting.  As noted the line breaks that immediately precede an increase in indent are uninteresting beyond that indent. So we can move them to after the matching decrease in indent.

To make this a little clearer, consider the following which has the same two dimensional structure as the previous example but with simplified content.

A
  B
    C
    D
  E
F
  G
H

When we tokenize this we create “IN” and “OUT” tokens to record the indents, and  add a “NL” token (newline) to record the line breaks. Any NL that would be before an IN is moved to after the matching OUT.  This reflects the fact that the indented content is in some sense a continuation of the starting line.

So the tokens produced by that above are:

A IN B IN C NL D NL OUT NL E NL OUT NL F IN G NL OUT NL H NL

So there are 8 “NL” tokens as you would expect for 8 lines, and each OUT  token is both preceded and followed by an “NL”, allowing both the inner and outer structures to possibly be terminated.

With this parsing of newlines there is no longer any desire to multiply newlines, only to sometimes ignore them.

Taking a new example – similar to the earlier one but with two distinct halves:

1: if condition:
2:    print message1
3:    print message2
4: a = b +
5:     c +
6:     d

Here lines 1-3 have the same two dimensional structure as lines 4-6.  However the newline token at the end of line 2 must be meaningful while the newline at the end of line 5 should be ignored.

This distinction tells us that we cannot handle newline tokens purely automatically like we do with IN and OUT tokens.  Rather the grammar for the language must tell us where newlines are expected, such as at the end of a statement.
So when the parser sees a NL on input it has a number of options.

  1. It can SHIFT the NL.  This is only legal if the current state has a ‘go to’  entry for NL.  If that is the case, then SHIFTing is most likely a sensible option, though possibly we should wait until a later NL after an OUT.
  2. It can REDUCE the production which is reducible at this state, if there is one.  This only makes sense if  this could lead to a state where NL  can be shifted.
  3. It can discard the newline.  This should happen for example at the end of line 5 in the example above.  Or
  4. It can trigger error handling.  This would be appropriate at the end of line 4 above if line 5 were not indented.

The important choice here is whether to DISCARD the NL or not.  If we choose not to, then choosing between SHIFT, REDUCE, or ERROR is done in the same way we would for any other token.

This choice of discarding newline tokens must be closely related to the level of indent.  The intuition is that indents lead to continuation.  A block of text that is indented is a continuation of the line that preceded the indented block.  Sometimes we ignore line breaks in indented text, other times we don’t.  We need to determine how to tell the difference.

Falling back on my intuition again, I think that there are line-like things and word-like things.  These may also be referred to as vertical or horizontal objects, to borrow some terminology from TeX.  Statements and things composed of them are line-like.  Expressions and things they are comprised of are word-like.

My intuition suggests that we would expect a line break to terminate a line-like thing, but not a word-like thing, unless in doing so it also terminates a line-like thing.  In the example above, lines 2 and 3 are line-like things, so we expect the line break to terminate them.  Lines 5 and 6 aren’t.  Lines 4,5,6 together form a line-like thing and so the new line that was moved to after the OUT indent reduction will terminate that whole line-like thing.

Returning to the first example, the assignment starting on line 2 is separate form the statement on line 5 because statements are line-like and there is a linebreak separating them.

Were we using a recursive descent parser these ideas would seem quite easy to implement.  We would first deduce which non-terminals were line-like as being precisely those that could ever expand to contain a newline token.  Then, as a recursive descent parser always knows what non-terminal it is trying to recognise, it could ignore newlines precisely if there had been an indent since the deepest line-like non-terminal started.

As we are using an LR grammar, and so don’t always know what non-terminal we are trying to recognise, we need to take a slightly different approach.  With LR, we do always know what symbol preceded a particular state.  This is because each item in the core (i.e. uncompleted) itemset for a state always has the same symbol before “dot”.  This is exactly the symbol which leads to the state through the “go to” table.

If we know what the previous symbol is at any state, and if we know which symbols can end in a newline token, then we know which states mark the start of a line-like thing.  A symbol which can end with a newline may not always end with a newline as a statement could, for example, end with either newline or a semicolon.  However any location where a newline could appear can reasonably be seen as a boundary between two line-like things.

So when we see a newline token in the parser’s look-ahead buffer, we can examine the current state stack looking at states which start line-like things and looking at internal indents.  If the line-starting state closest to the top is above the top most internal indent, then we cannot ignore newline tokens: there is no indent to protect them.  If, on the other hand, an internal indent is closer to the top of the stack than a line-starting state, then we are currently in an indented continuation of a line-like thing and so newlines should be ignored.

Rather than searching the stack every time we find a newline, we can track whether newlines should  be ignored or not.

  • Whenever we shift a state which is line-starting, the new stack frame is flagged as permitting newlines.  When other states are shifted, we inherit the “newlines_permitted” flag from the previous stack frame.
  • When we reduce some states then if the net indent in all the reduced frames in positive, we record the new top-of-stack frame as ignoring newlines (i.e. newlines not permitted).
  • When we process an OUT token by decreasing the indent count on the top-of-stack frame, we must update the “newlines_permitted” flag depending on the state and remaining indent level.  This  may require copying the flag from the previous level of the stack.

Once we  track the “newlines_permitted” state this way it becomes trivial to handle newline tokens.  If one is seen when “newlines_permitted” on the top of the stack is false, we simply discard the newline.  If “newlines_permitted” is true, we process the newline just like any other token.

Updating the grammar processing to track which symbols can end with a newline token, and which states can immediately follow such a token is fairly straight forward.  The “ends with newline” flag can be computed when the “nullable” flag is computed with a similar repetitive processes.   If the body of any production contains an “ends with newline” symbol followed only by nullable  symbols, then the head of that production is flagged as “ends with newline”.  Tagging the affected states is even more straight forward: if the symbol which lead to a new state “ends with newline”, the new state can start a line.  If not and the symbol is nullable, the new state starts a line if the previous state did.  Otherwise the new state does not start a line.

My parser generator, modified to handle line breaks as described here, can be found in `git://neil.brown.name/ocean/` or a this version.

One last issues seems worth mentioned before we depart this topic.  That related to what should happen if a line-like object does not start at the beginning of a line.  For example:

a = b ; x = y 
   + z
   + q

This would be perfectly legal with the parser as currently described, but looks wrong and is possibly not what is intended.  It seems hard to justify raising an error or changing the obvious parse when this happens, but it is almost certainly cause for a warning, and probably a very stern warning at that.

We would need to track which symbols started at the real start of line (i.e. after NL or IN).  If an indent and the previous “starts line” state was not at the start of a physical line, then a warning should be raised.  I haven’t implemented this yet, but I probably will.

 

Posted in Language Design | Leave a comment

Parsing a two-dimensional language.

Many programming languages are essentially one dimensional.  The parser treats them simply as a linear sequence of tokens.  A program could all be written on a single line, or with each token on a separate line and the parser or compiler wouldn’t notice the difference.  This set of languages includes Algol, Pascal, C, Rust and many others.

Some languages are 2-dimensional in a bad way. FORTRAN is probably the best example, though BASIC is similar.  These (at least in their early forms) had strict requirements as to what can go on one line and what needs to go on a separate line.

A few languages are exploring the middle ground.  Go will treat Newlines like semi-colons in certain cases which can result in a 2-dimensional feel, but brings with it some rather ad-hoc rules. Python probably makes the best attempt at 2-dimensional parsing of the languages that I have looked at.  It allows newlines to terminate statements and also uses indents to indicate the grouping of some language elements, particularly statement groups.

While I find a lot to like in Python, it seems imperfect.  I particularly dislike using a backslash to indicate line continuation.  You can avoid this in Python by putting brackets around things as a newline inside brackets is ignored.  But this feels like a weakness to me.

As I wrote in a recent article for lwn.net:

The recognition of a line-break as being distinct from other kinds of white space seems to be a clear recognition that the two dimensional appearance of the code has relevance for parsing it. It is therefore a little surprising that we don’t see the line indent playing a bigger role in interpretation of code.

This note is the first part of a report on my attempt to translate my intuition about parsing the two dimensional layout into some clear rules and concrete code.  This note deals with indents.  A subsequent note will look at non-indenting line breaks.

Indenting means continuation, newline means completion.

The two elements that give a section of code a two dimensional appearance are line breaks and indents.  Without the line break there wouldn’t be any extra dimension at all.  Without indents the code would just look like a block of text and we would be in much the same situation as FORTRAN and BASIC.

Each of these has a fairly well understood meaning.  A line break often marks the end of something.  This isn’t always true, and exactly what the “something” is might be open to interpretation, but this meaning is fairly well understood and line breaks rarely mean anything else.

An indent means some sort of continuation.  The line that is indented with respect to a previous line is in some sense part of that previous line.

To help visualise this, consider:

A B C
   D E
   F
G H

Here “D E” and “F” are clearly part of the thing which starts “A B C”. “G H” may or may not be, much as “F” may or may not be part of “D E”.  However if “D E” or even “B C” started something, then it definitely finished by “F”.  “G” might be part of something that started a “A”, but if some subordinate thing started at “B” or later, it does not continue into “G”.

As a first step to formalising this we need to be able to encode this two dimensional structure into the stream of tokens which the parser can then process.

Generating a “NEWLINE” token at the end of each line, and maybe a “SPACE” token for each space at the start of a line could capture it but not in a very helpful way.  It is not really the presence of an indent that is important but rather the change in indent.  So between “C” and “D” we need to know that the indent level increased, while between “F” and “G” it decreased.

Python uses the terms “INDENT” and “DEDENT”.  The first is technically inaccurate as every line is indented, not just the first.  The second is a neologism presumably derived by analogy with “increase” and “decrease” or similar.  Other options might be “UNDENT” or “OUTDENT”.  I’m not sure I really like any of these.  However “in/out” seems the best fit for what is happening so I’ll just call the tokens which report indent changes “IN” and “OUT”.

Using these two tokens, and ignoring the non-indenting line breaks for now, the above structure can be tokenized as:

A B C IN D E F OUT G H

This is very simple as there is only one indented block. A more complex example is:

A B C
  D E
   F G
 H
I

This would be tokenized as:

A B C IN D E IN F G OUT OUT IN H OUT I

 

Note the sequence “OUT OUT IN” – we close two indented sections and open another one all at one line break.  This is because “H” has a lesser indent than “D”, but a greater indent than “A”.

This tokenization captures the structure very effectively and is what I will be using. Unsurprisingly, these are exactly the tokens that can be generated by the scanner I previously described though the token names have been changed.

What to do with indents?

Now that we have these IN and OUT tokens we need to understand what to do with them.  Their significant value is that they are very strong syntactic markers.  IN and OUT are strongly paired: it is very obvious which OUT matches which IN.  This contrasts with bracketing markers like “{ }” or BEGIN and END or DO and OD.  The latter can be accidentally deleted or forgotten.  If OUT is forgotten, that omission will be obvious on every line.

I’ve recently be writing C code using the “text” mode of my editor (emacs) rather than the C mode.  This is because I’ve been experimenting with literate programming and writing the code inside `markdown` markup.

I have really noticed that lack of help with auto-indenting.  Each line is auto-indented to the same level as the previous line, which is certainly useful.  But typing `}` doesn’t automatically drop back one level of indent.  So sometimes I get it wrong.

When I do get it wrong, the compiler complains (as it should) but it is usually a cryptic complaint that is way too late.  I would much rather it know there is a problem as soon as the indenting looks wrong.  So error detection will be the first focus for exploring the use of indents.

To discuss how this works you really need a working knowledge of grammars and parsing.  If you don’t have that yet, let me suggest my earlier note of LR grammars. If you don’t have time for that now, then press on anyway, it shouldn’t be too hard.

Unlike other tokens used in an LR grammar, IN and OUT cannot simply appear in the grammar where ever they are expected.  This is because there are really too many places that an IN can appear.  An OUT must be at the end of something, but that matching IN can often be anywhere in the middle.  So the parser will need to have direct knowledge of IN and OUT tokens.  It will not “SHIFT” them onto the stack as it does with other tokens.  It has to handle them directly.

As an OUT enforces the end of something the natural thing to do when an OUT token is seen in the look-ahead is to perform REDUCE actions until the OUT is balanced by some IN that was previously seen.  This is clearly a very imprecise rule.  We need to explore some examples before we can make the rule suitably precise.

We can start looking at examples supposing a grammar containing:

Statementlist -> Statementlist Statement
               |
Statement -> if Condition : Statementlist
           | print String

A program fragment like:

if a == b :
   print "Hello"
   print "World"
print "End"

would then result in the following LR stack just as the OUT token appears in the look ahead:

if Condition : Statementlist print String   [OUT]

The IN appeared just before the Statementlist and the only possible reduction is to reduce”Statement -> print String” so we would do that which results in

if Condition  : Statementlist Statement    [OUT]

Again we can reduce and still stay after the indent so now we reduce “Statementlist -> Statementlist Statement” and get to:

if Condition : Statementlist   [OUT]

Now the symbol on the top of the stack is where the indent started, so we need to stop reducing.  It might seem appropriate to reduce this fully to a “Statement” but we cannot do that based on just the indent.  Consider a different language with a fragment like:

if ( a == b ) {
    print "Hello";
    print "World";
}

In this case it is clear that the OUT itself is not enough to close the whole statement.  Resolving this will have to wait until we explore how line-structure works.

So the approach for handling IN and OUT seems to be that when we see an IN, we mark the next symbol SHIFTed as being indented.  Then when we see OUT we stop SHIFTing and REDUCE until the symbol on the stop of the stack is marked as indented.  We can then clear that marking, discard the OUT, and continue.

This is close to a working solution but there are a few complications that need to be resolved first.  One relates directly to the fact that “Statementlist” is recursive.

Statementlist -> Statementlist Statement

This means that reducing until we just have “Statementlist” on the top of the stack doesn’t stop more things being added to that Statementlist. The idea was that the OUT should complete close the Statementlist but because of its recursive nature that isn’t really possible.

This is easily fixed by introducing a second “dummy” nonterminal.

Statementlist -> _Statementlist
_Statementlist -> _Statementlist Statement
                |

This set of productions will parse exactly the same language but will introduce an extra state.  If we see an OUT token we can reduce back to a “_Statementlist” as we did before, and then go one more step and reduce to the new “Statementlist”.  Once there we really have closed the list.  Another statement cannot extend it like it could in the simpler grammar.

These extra “dummy” non terminals could easily be added by the parser generator itself.  However they solve one problem by providing another.  Previously we could simply REDUCE until the top of stack carries an IN, and then cancel the IN.  Now we might need to REDUCE one more step – the “Statement -> _Statement” production above. How can we know when to stop?

The answer to this requires that we store one more bit (literally) of information in the state stack.  We currently have a count of the number of indents which appear “within” (the full expansion of) the symbol in the stack entry.  By that we mean the number of terminals which were reduced into this symbol and were immediately preceded by an IN. To that we must add a flag to record whether the first terminal was preceded by an IN.  i.e. is the whole symbol indented or just part of it.

Now the rule will be that if we have an OUT token in the lookahead buffer and the top-of-stack token contains an IN that was not at the start then we can simply cancel the IN and discard the OUT.  Further, if the top-of-stack contains only one IN and it was at the start then we need to look at the production that we can reduce.  If that production contains more than one symbol in its body then again we can cancel the IN and discard the OUT.  In this case, that IN would be not at the start of the symbol we would reduce to, so it is somewhat similar to the first case.

UPDATE: The reasoning in the above paragraph is flawed.  It talks about looking at “the production that we can reduce”.  That isn’t well defined (there may not be anything to reduce at some states) and it isn’t really the right thing to be talking about.

The important production is the one that will ultimately be reduce to consume the symbol on the top of the stack.  If that production starts earlier than the top-of-stack symbol, the it is perfectly OK to cancel the IN with the OUT in the look-ahead as the top-of-stack symbol is exactly the text that was indented.  If the production consists of precisely the top-of-stack symbol (i.e. the body of the production only has one symbol), then now is clearly the time to reduce that production, and we need to ask this question again.  If the production which will ultimately consume the top-of-stack symbol starts with that symbol but needs other symbols too, then we have a problem.  Those other symbols will be at a lower indent level than the start of the top-of-stack symbol, and that isn’t allowed.  In that case we need to raise an error and find a way to reduce that product early.

A serious problem with the above is that we don’t necessarily know what production that ultimate one will be.  LR shift/reduce parsing doesn’t identify a production until it is time to reduce it.  So we need a different approach.  If we ensure that all recursive non-terminals have a dummy parent non-terminal as described above, that approach is fairly easy, but does require extra information in each state.  We need to know which states have any item with ‘DOT’ after the first symbol.  If we find ourselves in such a state and it cannot reduce that first symbol, then we know that there is a real possibility of the indent being interpreted wrongly and need to raise an error.

END-OF-UPDATE

Another way to look at this is to say that when we see an OUT token, we reduce until the next reduction would contain an internal (i.e. not at the start) IN.  Once we find this internal IN we cancel it and discard the OUT.

Indenting affects error handling in two ways.  Firstly we need to know what happens if an OUT implies a reduction is needed, but the current state does not allow a reduction.  Secondly we need to know how to handle IN and OUT tokens if they are found while discarding input tokens during normal error recovery.

In the first case, the fact that we want to reduce suggests that we must have part of the right hand side of a production but not all of it.  It would be nice to keep what we have, so to complete it we need to synthesize the extra tokens needed to make a complete  production.  If there is no reducible production at this state then there must be symbols that can be shifted.  So we can synthesise one of these – the one  that gets us as close to the end of a production as possible.  This may result in an ‘unknown variable’ or similar which would need to be carefully suppressed, but would allow any valid units with the production so far to be properly analysed.

UPDATE: Synthesizing a symbol might sound easy, but it isn’t.  As well as the symbol we need to synthesis the data (typically part of an Abstract Syntax Tree) that needs to accompany it.  Synthesizing that automatically is not possible.  A possible approach is to use productions like “SomeSymbol -> ERROR” to synthesize symbols.  The reduction code with that production can synthesize the necessary AST data.  This may introduce other complications.  END-UPDATE.

For the second case we can simply count the INs and matching OUTs until we find a SHIFTable symbol or find more OUTs than INs.  If we find a SHIFtable symbol, we set the IN count for that symbol to the net number of INs.  If we find an extra OUT it means that we must have come to the end of whatever high level non-terminal was in error.  Then we just leave the ERROR token on the stack without shifting anything else and move on to normal handling for an OUT token.  This will either reduce whatever is on the stack, or will synthesize some token to allow parsing to continue.

It should be noted here that I haven’t experimented a lot with this error handling process.  It seems to make sense, but often things which seem to make sense actually don’t.  So this might need to be revised.

This then is the complete description for how to handle IN and OUT tokens.  In summary:

  1. Each frame in the LR state stack must contain a count of the number of IN tokens it contains, and a flag to record if one of these is at the start.
  2. When we see an IN token we record the fact and discard it.  When we SHIFT in a new terminal we check if we just saw an IN.  If so the new entry is the stack records that one IN is present, and that it is at the start.
  3. When we reduce a series of symbols to a single non-terminal we add up the number of included IN tokens in each and use that as the count for the new token.  The “one at the start” flag from the first symbol is inherited by the new symbol.
  4. When we see an OUT token we check if the top-of-stack contains a “not at the start” IN token, or if it contains an “at the start” IN but the next reduction would make it “not at the start”.  If either of these are true we decrement the IN counter and discard the OUT.  Otherwise we REDUCE a production if we can, and synthesize some suitable  symbol to SHIFT if we cannot.
  5. When we discard stack states during error handling, we count the number of indents and add them to the “ERROR” state that we ultimate SHIFT.
  6. When we discard look-ahead tokens during error handling we keep count of any IN and OUT tokens.  If we see an OUT token when the count is zero, we assume a sane state has been found and return to where step  4 above was handled.  Otherwise the IN count of the token which ultimately resolves the error is set to the balance of INs and OUTs.

The net result of this is that an OUT always closes anything that was started at or after the matching IN, but nothing that was started before the matching IN.

This adjunct to LR parsing can be used in two different but related ways.  Firstly as already noted it can improve error reporting.  If a ‘closing’ token like “}” is missing then that will be noticed immediately.  The syntactic unit that it is meant to close (e.g. Statementlist) will already have been reduced by an OUT and if the “}” doesn’t follow a regular LR parsing error will occur.

Secondly, and more interestingly for language design, an ambiguous grammar can be cleanly resolved by treating indentation as meaningful.

The classic “dangling else” case can be easily resolved by the indentation of the ‘else’.

if (COND1)
    if (COND2)
        StatementA;
else
    StatementB;

is no longer ambiguous.  The “else” clearly relates to the first “if”, not the second.

Ocean will make use of this to have a syntax for “blocks” of statements which does not require a closing token.  They will be closed by a reduction in indenting.  As explicit closing can sometimes be preferred, Ocean will also allow C-style blocks.  Possibly that should be “go-style” blocks as a block must not be a single statement.

The syntax I’m experimenting with at the moment includes:

Block -> : Statementlist
       | { Statementlist }
Statement -> if Expression Block
           | if Expression Block ElsePart
           | ...
ElsePart -> else Block
          | else if Expression Block
          | else if Expression Block ElsePart

This allows structured statements that look like Python or like Go, and will use indentation to resolve ambiguities in the Python style, and report errors in the Go style.

My intention is that several other structures will allow either “{}” or “:..” formats.  “struct” type declarations is an obvious one.  Others might occur as design proceeds.  However before we can proceed with those details we need to finish the two-dimensional parsing which is only half done.  This note dealt with indents, the next note deals with end-of-line marker.

Meanwhile code that manages indents as described here can be found in `git://neil.brown.name/ocean` or this version of the parser generator.

Posted in Language Design | 1 Comment

Wiggle 1.0

About 11 years ago I started writing “wiggle”.  I have finally released version 1.0.

Wiggle is a tool for applying patches with conflicts.  I do quite a bit of applying patches to a release different to the one they were created for.  This often works, and often fails completely so that the task must be done completely by hand.

But sometimes – and often enough to be worth automating – the ‘patch’ program cannot apply the patch successfully, but the reason is due to small differences that can automatically be detected and ignored.  This is where “wiggle” comes in.  It looks at the patch word-by-word, and wiggles that patch in if at all possible.

I got wiggle to the point where this basic functionality worked a long time ago.  But I found it less that satisfactory.  Often “patch” would fail and “wiggle” would succeed and I would want to know why.  And “wiggle” didn’t really tell me why.

This lead me on the path of writing “browse” mode for wiggle. This allows me to look at the merge and see the different bits highlighted so I can see what confused “patch” and decide if it was reasonable for wiggle to silently fix it.

I then found that I wanted to be able to “fix” things up while in browse mode.  Some little conflicts that wiggle cannot fix automatically still have simple fixes that I could determine based on my understanding of the code.  Figuring out how to do that was not easy.  I knew in general terms what I wanted to achieve, but had no real idea of the concrete steps.

So much of the last few year, when I have had the opportunity, has seen me improving various aspects of the browser, making it easy to move around and handling various corner cases and getting enough ground work in place that I could usefully experiment with editing. Only recently have I taught wiggle some very basic commands for editing.  I don’t really know if they are enough, but they will do for now.

There are basically 3 things you can do.

  1. You can mark a conflict or change and “unchanged”.  This effectively ignores part of the patch and leaves the original as it is.
  2. You can mark a conflict or ‘unmatched’ as ‘changed’.  This effectively discards the original and keeps what was in the patch, even though it might not have lined up properly.
  3. After the above has proven to be insufficient, you can open the merge result in an editor and fix the rest up by hand.

Obviously the last possibility has always been an option, though not directly from wiggle.  This is how I most often work with failed patches.  I apply “wiggle” then look at the results in an editor.  Allowing the editor to be invoked directly from wiggle should streamline the process a little.  Allowing some simple changes before hand should make it even better.  Only time will tell.

So wiggle 1.0 is out there, in git://neil.brown.name/wiggle/ or http://neil.brown.name/wiggle/.  If you want to know more, you can see the talk I recently have at LCA2013.

Posted in wiggle | 2 Comments

Parsing LR grammars

It is time for another diversion, but it is leading towards the language design – honest.

LR grammars are a tool for specifying the grammar for a language, and it is fairly easy to automatically generate a parsing tool from a grammar.  So they have often been used for this purpose.

There seems to be some suggestion that LR is no longer to tool of choice (see wikipedia), apparently because it is hard to do good error reporting.  The gcc compilers converted from an LR grammar to a recursive descent LL grammar, apparently for this reason.

However I like LR grammars and I don’t see the problem.  So I plan to start out with an LR grammar approach.  Either it will work well, or I will start the see the problem, and either outcome in positive in my view. So I cannot lose.

There are plenty of tools available for converting an LR grammar into code, of which GNU “bison” is probably best know, but I’m going to write my own.  Partly this is because it will help me learn, and partly because it will add to the collection of code that I have recent experience with so that when Ocean starts to take shape I will have lots of example code to try translating to it.  But mostly because it is fun.

So this post is about LR grammars and building a tool to parse them.  You’ve probably read all this before elsewhere, so I suggest you stop reading now and go do something useful like play sudoko.  Come back in a few weeks when I will have moved on.

However, it is also possible that you have tried to read about this and, like me, failed to find anything really accessible.  There are academic papers that are full of dense mathematics which makes an interesting topics appear unapproachable.  And there are lecture notes which seem to skim over the really interesting details.  And of course there is wikipedia, which isn’t exactly bad, but isn’t exactly good either.  If that is you, then please read on and see if my contribution helps you at all.

LR Grammar Basics

A grammar is a set of rules for mapping between structured data and a linear representation of that data.  The grammar can allow you to “produce” a linear form from a structure, or “parse” a linear form to recover the structure.

We use grammar whenever we convert thoughts and idea to sentences and stories.  We use the same grammar in reverse when we convert sentences and stories that we hear and read back into thoughts and ideas.  Of course the grammar of natural language is very complex and context-dependent, and each individual works with a slightly different version of the grammar.  So the conversion of ideas to and from natural language is imperfect and error prone.  It is however extremely powerful.  It allows me to write this note, and you to get at least some of my ideas while reading it.

For computer  languages we use much simpler but much more precise grammars.  Noam Chomsky famously did a lot of work in formalising many ideas of grammar and it is him we thank for the “context free” grammars that work so well for programming languages.

A context free grammar (CFG) consists of a set of symbols and a set of productions.  The symbols are grouped into two subset: the terminal symbols and the non-terminal symbols. There  is one distinguished non-terminal symbol called the “START” symbol.  The grammar describes a “language” which is the set of all possible “sentences” which in turn are lists of terminal symbols.  It does this by giving rules for producing parts of sentences.  Any sentence which can be produced following these rules is a valid sentence in the language. All other sentences are invalid.  To avoid confusion, we normally reserve the word “produce” for when a single symbol produces a series of other symbols.  Repeating this multiple times is a “derivation”.  A complete derivation starts with the “START” symbol and results in a list of terminal symbols.

Terminal symbols are the elements of a sentence.  These could be strokes of a pen, letters, words, or conceptual groups of words such as “noun” and “verb”.  For the purposes of computer languages we normally work with “tokens” as terminal symbols where tokens are the things extracted by the scanner described in my previous note.

A grammar takes a sequence of tokens and groups them into semantic units such as expressions and statements.  It essentially describes how to extract structure from  a linear sequence of tokens.

“Tokens” here are little more than opaque objects represented by numbers.  There is often extra detail combined with the number, but the grammar doesn’t care about that.  If the grammar needs to know the difference between two things, they need to be different tokens.  If it doesn’t then maybe they don’t need to be (though it doesn’t hurt).

The grammar describes how to gather sets of tokens (aka terminal symbols) together as what might be called “virtual tokens”.  So the set of tokens “NUMBER PLUS NUMBER” might become a virtual token “EXPRESSION”.  It can then be included in some other grouping of tokens.  These “virtual tokens” are the things referred to earlier as “non-terminal symbols”.

A grammar uses these terminal and non-terminal symbol in the set of “productions”.  Each production declares that some non-terminal symbol may produce a given list of symbols (whether terminal or non-terminal).

This terminology (‘start’ and ‘production’ and ‘terminal symbol’) doesn’t make much sense if you want to parse a program, but it does if you want to generate a program, or “sentence” as described earlier.  From  the perspective if generation, a symbol is “terminal” if it cannot generate anything more, and non-terminal if it can.

When we talk about parsing, we use the same terminology because that is what everyone else does.  I would prefer “real tokens” for terminal symbols, “virtual tokens” for non-terminal symbols, “reductions” for productions and “goal token” for start symbol.  But you gotta work with what you have.

LR parsing basics.

The process described above for generating sentences can process symbols in any order. Once you have more than one non-terminal you have a choice. You can produce something from the first, or from the second.  So you could build up a sentence from the beginning, or from the end.  When parsing you need to be more focused.

The “L” in “LR” means that parsing progresses from Left to right.  You certainly could parse in some different direction, but the language you design for that sort of parsing would end up being very different.  An “LR” grammar converts the leftmost groups of terminals into fewer non-terminals first and then progresses to include more and more terminals moving to the right.

We will get to the “R” shortly.  I suspect that originally “LR” meant “Left to Right”. But there are two distinct ways to parse from Left to Right, one is recursive descent which requires a different sort of grammar to that described here – an LL grammar.  For LR grammars you use a bottom-up shift/reduce parser.

This shift/reduce parser was invented by Donald Knuth and involves a state machine – with a bunch of states -  a stack (of symbols and states) and two actions named “SHIFT” and “REDUCE”. One particular instance of “REDUCE” – the one that involves the START symbol – is sometimes referred to as “ACCEPT”.  I see no value in making such a strong special case of that, so I keep just “SHIFT” and  “REDUCE”.

If we think back to the previous description of deriving a sentence from the start symbol using the grammar, and consider the intermediate stages between “START” and the final sentence, then those lists of symbols (sometimes call “sentential forms”) are represented in the parser by the stack combined with the remainder of the input.  The symbols from the bottom of the stack through to the top of the stack, followed by the “next” symbol to read from input through to the last symbol in the input correspond exactly to the sequence of symbols in some intermediate step when deriving  the program.

The “production” step in sentence generation is exactly mirrored in the “REDUCE” step in parsing.  Specifically if the top symbols on the stack match the symbols in the body of some production, then we might pop them off and replace them with the non-terminal which is the ‘head’ of that production.

The “SHIFT” step simply involves moving one terminal symbol from the input onto the stack.   Parsing proceeds by shifting terminal symbols onto the stack until we have a set of symbols that can usefully be reduced, then reducing them. This continues until the entire program has been read (or “SHIFTed”).  On each “REDUCE” step the parser will record information about the symbols that were popped so as to build up an abstract syntax tree which records the structure of the program.  So each entry on the stack holds not just a symbol and a state, but also some record of what was parsed down to that symbol.

At every “REDUCE” step, the non-terminal that results is the right-most non-terminal in the interim sentence (because it is on the top of the stack, and everything further to the right is terminals still to be read).  So if the series of steps that result from LR parsing were played in reverse as derivation, each step would  involve producing something from the right-most non-terminal.  This is where the “R” in “LR” comes from. The parse is the reverse of a “Rightmost derivation” from the grammar.  This contrast with “LL” grammars where the parse follows a “Leftmost derivation”.

There is an important word above we needs special attention.  It is “usefully” as in “until we have a set of symbols that can usefully be reduced”.  It is important that after we perform the REDUCE step,  the resulting interim sentence can still be derived in the  language.  So we always need to know what symbols are permitted at each stage in the parse.  This is what the “state” of the state machine provides us with.

A “state” really belongs between two symbols.  The “current” state is between the symbol on the top of the stack, and the next symbol to be read from the input.  When we push a symbol onto the stack, we push the current state with it, so that when that symbol is later popped, we can collect the state that preceded it too.

LR parsing step by step

Now that we have the big picture, the best way to understand the details is to watch how the parsing works.

The data that we have are:

  • The grammar, as a set of productions.  Each production has a ‘head’ which is a non-terminal symbol, and a ‘body’ which is zero or more symbols, some may be terminal, others non-terminal.  Each non-terminal may be the head of multiple productions, but must be the head of at least one.
  • An ‘action’ table which is a mapping from a state plus a lookahead to an action.
    The ‘state’ we have already introduced and will be described in more detail shortly. The “lookahead” is the next terminal to be read.  As we need to be able to look at the next symbol, the parser is said to have a look-ahead of ’1′ and is sometimes written “LR(1)”.
    The ‘action’ is either “SHIFT”, or “REDUCE(production)”.  i.e. the “REDUCE” action must identify which production is being reduced.  If the production has “START” as its head, then it is sometimes referred to as “ACCEPT”.  Once the ACCEPT action happens we have parsed a complete sentence and must have finished.
  • A “go to” table.  This is a mapping from state and symbol to a new state.  Whenever we push a symbol, whether due to a SHIFT or a REDUCE, we perform a lookup on the current state for the pushed symbol and that gives us the new state.

Given these data the parsing process is quite straight forward.

  1. Start with an empty stack, with the “0″ state current, and with the first symbol from the input in the look-ahead buffer.
  2. Look in the ‘action’ table for the current state and the symbol in the look-ahead buffer.
  3. If the action is “SHIFT”, push the token and the current state onto the stack, look up the shifted symbol in the go to table for the current state, and save the result as the current state.  Then go back to step 2.  Note that we don’t actually need to store the symbol in the stack, but it can be useful for tracing.
  4. If the action is “REDUCE”, see how many symbols are in the body of the reduced production, and pop that many entries off the stack.  The state stored with the last entry popped becomes the current state.  The entries popped must match the symbols in the production – the states will guarantee that.  This is why we don’t really need to store the symbols.
    Having popped those entries, we do any production-specific processing to build an abstract syntax tree and push the result on to the stack together with the current state (which was updated by the ‘pop’).
    Finally we look up the current state and the head of the reduced production in the go to table and set the new state.  Then if that reduced symbol was ‘start’ we finish, else we go back to step 2.
  5. But what if the action table doesn’t contain an entry for the current state and the symbol in the look-ahead buffer?  That means there is an error.  The symbol is “unexpected”.  We have to invoke the error handling procedure.

Error handling in an LR grammar

The easiest way to handle errors is just to throw our hands up in horror, report the error, and abort.  But that is not always most useful.  The preferred approach is to try to resynchronize with the grammar by throwing away a few symbols.  Often the error will be localized and if we can just get past it we can start doing useful parsing again.

Some writers suggest synthesizing symbols as well as discarding them.  For example you might guess that a semicolon was missing and so insert it.  I don’t believe that idea adds anything useful.  As you will see we do synthesise a single symbol which is the “error” token.  Various error productions can be used to transform this into any nonterminal which may be needed.  So while symbols might be inserted as well as deleted, that is just a side effect of the error productions in the grammar.  The parser mainly focusses on discarding unusable symbols.

At the point where we detect an error there are two obvious symbols that could be discarded.  We could discard the symbol in the look-ahead buffer, or the symbol on the top of stack.  Or both.  Or several in each direction.  Ultimately we want to get to a situation where the state, possibly popped off the stack, has an action for the symbol in the look-ahead buffer.  Choosing what to discard to get there cannot be done reliably without assistance from the language designer.

The language designer should “know” what the key structural elements and syntactic flags are.  Some symbols will occur often and have a strong meaning, so aiming for one of those will be most likely to get back on track properly.  The language designer can provide this information to the parser by specifying “error production”.

“Error” is a special terminal symbol that would not normally be found in the input stream. However it can appear in the body of a production and this makes it an “error production”.

A simple example is:

statement -> ERROR ";"

This says that an error followed by a semicolon can reasonably take the place of a statement. i.e. a semicolon is a strong syntactic symbol that nearly always means the end of a statement, even if there is garbage before it.

With a number of these error productions in the grammar, the task of error handling becomes quite easy.

Firstly the parser pops entries off the stack until the current state has an ACTION for ‘ERROR’.  That mean it is a state which can cope with an error production.  In the above example, it is a state which is expecting a statement.

Then the parser shifts in the ERROR symbol (updating the state with the appropriate ‘go to’)  and then starts discarding input symbols until it finds one for which the current state has an action.  At this point the parser is synchronized again and it can go on to perform the appropriate action as steps 2 and later above.

Building the tables

All that remains is to build the state table and the action table.  These will require manipulating various sets, so you better dig something suitable out of your ADT library.

State and the go to table.

The purpose of the ‘state’ is to be a record of how we got to where we are.  As there are a collection of states on the stack, the current state doesn’t need to be a complete record – a partial record can be sufficient. So as we build a description of each state we may need to throw away information. When we do that we must acknowledge it and be clear why that information can be discard.

“where we are” is at some point in some production.  We have (hopefully) already seen some of the symbols in the body of some production, and we may have some more that we are expecting.  However we could concurrently be at different points in different productions.  This can happen in two ways.

Firstly, and most simply,  if we are in the body of some production just before the non-terminal “X“, then we are simultaneously at the start of every production with “x” as its head.

Secondly, we may not yet know which of several productions we are in.  There could easily be two or more productions that start with some set of symbols, but finish differently.  While we are in that common prefix we need to assume that we are in all of the productions at once, and as more symbols become apparent, the number of possible productions we are in will decrease until it is just one.  That will be about when we REDUCE the production into a non-terminal.

To model these ideas we define a “state” to be a collection of “items”. An item is a particular production combined with an offset in the body of that production.  The offset is sometimes referred to as “DOT” so an item might be represented as:

expression -> expression "+" . term

This is a production which says that an “expression” can produce an “expression” followed by a “+” and then a “term”.  Here the production has been made into an item by placing a “dot” which you can see between the “+” and the “term”.

This item says that we might be in the process of parsing an expression and have seen an expression and a plus sign.  Of course we might be somewhere else where a “+” can follow an expression too, depending on what other items are in the state.

So a state is a collection of these items.  They are all the possible positions in a derivation that we could be in.  For all items in a state, the symbol immediately before DOT will be the same, unless DOT is at the start of the item.  Other symbols might be different.

There are three different sorts of item, each important in their own way.  They are distinguished by what follows the DOT.

A “completed item” has nothing following the DOT.  DOT is at the end of the production.  A “terminal item” has a terminal following the DOT.  A “non-terminal item” has (you guessed it) a non-terminal symbol following the DOT.  See if you can guess how each of these is used?

We find all possible states by a repetitive process which finds all possible “next” states from a given state, and keeps doing this to the known states until no more new states can be found.

The starting point of this search is the state formed from all productions that have the “start” symbol as the head.  For each such production (normally there is only one for simplicity) we form an item with DOT at the start.  This collection of items form the “zero” state, which is the state we start at when parsing.

Given a state we find other states by following a 2-step process.

Firstly we “complete” the state. For every non-terminal item in the state, we find all productions with that non-terminal as the head and add an item for each such production with DOT at the start (if that item isn’t already present). This effects the first way that we can have multiple items in a state.

Once we have added all possible “DOT-at-the-start” items we make a list of all symbols which follow “DOT” in any production.  For each of these symbols we can potentially make a new state (though it is possible the state already exists).

To make a new state given a completed state and a symbol we discard all items for which DOT is not followed by the given symbol, and advance DOT forward one for all items where DOT is followed by the given symbol.  If this state has not already been added to the state list, it must be added, otherwise the pre-existing state with the same set of items must be found.

For each symbol that produces a state from our completed state, we add an entry to the go to table, from the old state via the symbol to the “new” state.

As noted we discard items when creating a new state.  This must be justified.

All the discarded terminal and non-terminal items will still appear in some other state which is created from the current one.  So they aren’t lost,  just elsewhere.  This reflects that fact that as we consume more symbols from the input we narrow down the range of possible parses, so we must naturally narrow the breadth of the state.

The completed items which are discard are different.  No matter which new state we go to, all of those items will be gone, so the history they record of how we got here will be gone too.  However we still have it further up the stack.  If the current state has dot at the end of a production, then the state on top of the stack will have dot before the last symbol in that same production, and the state below that will have dot before the second last symbol, and so now.  Somewhere there will be DOT at the start of that production, and in that state will be recorded the production which lead to this one.  And so up the production chain such that the first state which records the START production will still be on the stack, at the very bottom.  So dropping these doesn’t loose any information as it is still on the stack.

Once we have applied the above 2-step process to all states, and doing it to the last state didn’t create any new state, we have a full set of states, and we have a complete go to table.

Hopefully the process of creating these states makes sense in the context of how the go to table is used during parsing.  A single state records all the productions that could be happening at each point in the parse so far, and the set of states in the stack form a complete record of where we are up to, and what can happen next.

In particular, by examining the current state we can know what actions are possible at each different point. This information is used to construct the “action” table.

The Action table

To produce the action table for a particular state, we start by examining each of the items in that state.  We have already done all that we can with the “non-terminal” items, so it is just the “terminal” and “completed” items that interest us now.

For each terminal item a reasonable action is to SHIFT the next input symbol if that symbol is the symbol after DOT.  So all the symbols that appear after DOT are added to the action table with “SHIFT” as their action.

Note that it isn’t really necessary to store  these in the action table.  The “go to” table for that state has exactly the same set of terminal symbols in it (no more, no less), so we could instead look for the look-ahead symbol in the “go to” table.   If it is present we assume the “SHIFT” action.

For each completed item, a reasonable action would be to REDUCE that particular production.  But we don’t currently know how to store these in the action table – which look-ahead symbol to associate them with.  Also it might be appropriate to REDUCE a production rather than SHIFT a new symbol.

It is in the different ways to address this question that the different flavours of LR grammars become apparent.  So we will address a few of them individually.

LR(0) grammars

LR grammars are sometimes called LR(k) grammars where “k” is the number of symbols that we look-ahead in order to decide the next action.  Most often LR(1) grammar are used.

An LR(0) grammar in one in which no look ahead is done.  For this to work there must be no questions raised by any state.  That is, a state may either have exactly one completed items, or some number of terminal items.  It may never have both and may never have two different completed items, as these would require a decision and we have no look ahead and so no basis to make a decision.

LR(0) grammars are too simple to be really interesting.  They require every production to end with a terminal which doesn’t occur anywhere  but at the end of productions.  So if every production was enclosed in brackets that would work.  However that nearly limits LR(0) to Lisp, and Lisp doesn’t really need a shift/reduce parser at all.

So LR(0) grammars are interesting primarily as a starting point, not for practical use.

LR(0.5) grammars

This is a label that I made up myself, as I think it describes a useful set of grammars but I have not seen it described elsewhere.  Where an LR(0) grammar uses no look-ahead, an LR(0.5) grammar uses one symbol of look-ahead, but only half the time.  Look ahead can guide  the choice whether to SHIFT or REDUCE, but not the choice of which production to reduce.

So for a grammar to be LR(0.5), each state may have at most one completed item in its item set.  When that is the case, all terminal items cause a SHIFT entry for the appropriate terminal to be added to the Action table, and all other possible entries in the Action table are set to REDUCE  the one completed item.  If there is no completed item then all other entries in the Action table lead to error handling.

As before, you wouldn’t actually bother storing this Action table. You would just record the one REDUCE action as a default.  Then if the look-ahead symbol is present in the go to table, the parser performs SHIFT, otherwise it performs the default REDUCE or triggers an error.

I have seen it suggested that a weakness of this approach is that it doesn’t detect an error at the earliest possible moment.  We may have to wait for several reductions to happen until we hit a state with no possible reductions, and which does not have a ‘go to’ for the current look-ahead token.  Again, I don’t really understand this logic – those reductions don’t consume any input symbols so the error is still detected at the correct place in the input text. But maybe I’ll understand better once I’ve had some more experience, so stay tuned.

In my experiments with a grammar for Ocean, I’ve found that I never have two completed items in the one state.  So a grammar such as this could well be sufficient to parse the language I will end up with.

This highlights as important aspect of my focus here.  I don’t particularly need a grammar parser that is suitable for any language I get handed.  I just need a parser which will handle the language that I will develop.  And I plan to develop a language with a very simple and regular syntax – one that is easy enough for a mere human to understand.  So I don’t want a lot of complexity.  So while LR(0.5) may not be good enough for a lot of use cases, it could still be good enough for me.

However it isn’t quite.  Not because my language won’t be simple, but because LR(0.5) cannot tell me exactly how simple my language is.

This is a bit like the fact that a lot of syntax in programming language isn’t needed for correct programs.  It is only needed for incorrect programs to help the programmer see exactly where it is incorrect.  Similarly if I get my grammar right, then LR(0.5) should be enough.  But if I get it wrong, I may not know if I only do an LR(0.5) analysis of it.

So I need more power in my grammar analysis and so must move on.

Precedence LR Grammars

This is a bit of an aside as Precedence grammars don’t help detect errors and inconsistencies in a grammar, they only help to fix them.  However they are useful to understand and as this is where they fit in the progression of complexity, this is where I will discuss them.

The idea with a precedence grammar is that the designer can make extra statements about different productions, giving them different precedence.  This can apply in two different ways.

Firstly it can be used to choose between two completed items in a state.  If we are trying to treat a grammar as LR(0.5) and find that some state has two completed items, we can “fix” the grammar by specifying a precedence of the various productions.  I’d like to give an example here but it would end up being a toy example of a completely useless grammar and so wouldn’t really help. I personally cannot see why you would ever want a grammar which had two completed items in the same state.  It means that some sequence of input tokens could be treated as one thing or as another depending only on what comes next.  That sounds a like a design mistake to me.  Maybe I’ll eat my words later, but for now this means I cannot find a useful example – sorry.

The other way that precedence can be useful is much easier to explain.  It concerns the choice between SHIFT and REDUCE.  As noted above, an LR(0.5) will always SHIFT when it can, and only reduce when it cannot.  That might not be what you want.

Here there is a classic example: expression parsing.  Consider this grammar.

expression -> expression "+" expression
            | expression "*" expression

A state in this grammar could have an item containing

expression -> expression "*" expression .
expression -> expression . "+" expression

That is, we have parsed an expression, a multiplication symbol and another  expression, and how we are faced with a choice of REDUCING the multiplication to a single expression, or SHIFTing in a plus sign.

The mathematically standard thing to do would be to REDUCE as multiplication has higher precedence than addition.  However the LR(0.5) grammar parser would shift by preference which would produce the “wrong” result.  So this grammar is “wrong”, but is easily fixed by giving the multiplication production a higher precedence than the addition production.  The grammar processing step would then notice that a REDUCE step has a higher precedence than the SHIFT, and would remove the SHIFTed symbol from the “go to” table.

In order to make repeated use of the same operations (e.g. “a + b + c”) group correctly (as “(a + b) + c”) we would also want to reduce an addition before shifting another plus sign. So a production in a “completed item” might need a higher precedence than the same production in a “terminal item”.  To get a completely general solution, we required each precedence level to be left-associative or right-associative.  If the former, then reduce beats shift.  If the latter then shift beat reduce.

These precedences can be useful, but they aren’t really required.  The same effect can be achieved with more non-terminals. For example

expression -> expression "+" term
            | term
term -> term "*" factor
      | factor

By giving different names to things that are added and things that are multiplied, and by putting different names on either side of the operations, the precedence and associativity can be specified implicitly and with complete generality.

Which of these forms you prefer is very much a subjective issue.  There is little practical difference, so it is best to choose the one that makes most sense.  I think I prefer the latter, though the former is briefer and that is not insignificant.

The one practical difference that exists is that the second form involves more REDUCE steps during parsing.  Reducing

term -> factor

doesn’t really do much, but is a step which is completely unnecessary in the first grammar which needed precedence to disambiguate.  So if you find your parser spending too much time reducing trivial productions, it might be useful to re-write it in a form that doesn’t require them.

SLR(1) Grammars

So far we have looked at what we can do with just the “go to” tables and some simple conventions or annotations.  But it is possible to extract more details from the grammar itself and see if we can resolve some potential conflicts automatically.

The first of these is call “SLR” for “Simple Left-Right” grammar.

SLR(1) adds the computation of “FOLLOW” sets.  For each non-terminal “s”, “FOLLOW(s)” is the set of all terminals which could possible follow “s” anywhere in any sentential form derived by the grammar.  This is then used to build the ACTION tables.

When we have a completed item, we look at the symbol in the head of that item, find the FOLLOW set for that symbol, and for every terminal in the FOLLOW set, add an Action to REDUCE that production if the given terminal is the look-ahead symbol.

This will often clearly identify which symbols should trigger a reduction (and which reduction to trigger), and which symbols should be shifted.  Consequently it also identifies which symbols are in error.

If there is any over-lap between the FOLLOW sets of reducible productions or between a FOLLOW set and the SHIFTable terminals, then the grammar is said to not be SLR(1).  If there is no conflict, the grammar is fine and we have a well defined action table.

Computing the FOLLOW sets requires three simple steps.

Firstly we flag all the symbols that can produce an empty string.  Normally relatively few symbols can do this, though any time you have a list that could be empty, you will need a “nullable” symbol (as these are called).  These are found by an iterative process.

Firstly we flag all the non-terminals which are the heads of productions with no symbols on the right hand side.  Then we flag all non-terminal were all the symbols on the right hand side are flagged.  If there were any, then we repeat the process again and again until we cannot flag any more non-terminals as nullable.  Then we can stop.

Having done this, we now proceed to the second step which is to build a FIRST set for every non-terminal.  You can include terminals as well if it makes it easier, however the FIRST set for a terminal contains simply that terminal, so it is quite easy to make a special case of terminals.

“FIRST(s)” is the set of all terminal symbols which could be the “first” terminal in an expansion of “s”.  We compute this using a similar iterative approach to calculating “nullable”.

We walk through all the productions in the grammar and add to FIRST() of the head symbol, the FIRST set of every symbol in the body which is preceded only by nullable symbols.  So the FIRST set of the first symbol on the RHS is always added, and the FIRST set of the second symbol is too if the first symbol was flagged as being nullable.  And so on.

We continue this process checking each production again and again until no more changes in the FIRST sets can be made.  Then we stop.

Once we have the nullable flag and the FIRST sets we can proceed to the third step which is to create the “FOLLOW” sets.  This again follows a pattern of repeating a partial calculation until no more change can be made.

To find “FOLLOW(s)” we find every occurrence of “s” in the body of any production, and add FIRST(t) where “t” is the symbol after “s”, or any symbol following “s”but only separated by nullable symbols.  If all of the symbols following “s” are nullable, then we also add FOLLOW of the head of the production.

Adding the FIRST of following symbols only needs to be done once, as FIRST is stable and won’t change any more.  Adding FOLLOW of the head might need to be done repeatedly as other changes to FOLLOW sets might need to flow back to this one.

Again, once we don’t find any more changes we can stop.  At this point we have complete FOLLOW sets and can determine if the grammar is SLR(1), and if so: what the action table is.  If the grammar is not SLR(1), it might be fixable with precedence annotations.

LALR(1) grammars

SLR(1) grammars are pretty general but not completely so.  If a non-terminal is used in two quite different parts of the grammar then it is possible that it is followed by different symbols in different places.  This can result in it reporting conflicts that won’t actually happen in practice.  This is because we have been too general in generating the FOLLOW sets, assuming that the FOLLOW set of any symbol is the same in every state.  Often it is, but not always.

If we are going to bother generating FOLLOW sets, and particularly if we are doing it for the purpose of checking if there are any unexpected possible inconsistencies in the grammar, then we may as well do it properly.  Doing it properly is called “LALR(1)”, where “LA” stands for “LOOK AHEAD”.

If you look in the literature you can find lots of different ways of calculating the LALR LOOK-AHEAD sets.  Many of these start with the states we have from the LR(0) calculation and go from there.  I don’t really like those – I find them confusing and they don’t really make it clear what is happening (without lots of head scratching).  So I’m presenting my way of doing it.  I didn’t find this in the literature, but maybe that is because I didn’t understand all that I read.  I’m sure someone did come up with it before me.

This procedure requires the FIRST sets and the ‘nullable’ flag developed for the SLR grammars, so we will keep them but discard the FOLLOW sets.  In place we will generate lookahead or LA sets, one for each item in each state. So LA(s,i) is a set of terminals which we expect to see after the item ‘i’ is reduced in or following state ‘s’.

If we go back to the description of items and state and go to table it is not hard to see in retrospect that we lost some information, at least if we were considering an LR(1) style grammar.  We didn’t include the look-ahead symbols in our description of state, but now we must.

So: each item in a state is not only a production and an offset in the body of that production.  It must now also contain a set of terminals which can follow after that production (i.e. after the body or, equivalently, after the head).  For the items based on the “start” symbol that are used to populate the first state (state zero), the look-ahead sets contain only the special “end-of-file” terminal.

When we complete a state by adding an item for each production from each non-terminal which follows “DOT”, we calculate the LA set for that item in much the same way that we calculated FOLLOW.  We add together the “FIRST” sets for all subsequent symbols which are separated from the target non-terminal only by nullable symbols, and with use this sum of FIRST sets as the LA set of each item we make from that target non-terminal.

If all of the symbols following a non-terminal are nullable (or if there are no symbols following it), we need to add in the LA set for the source item.

So if a state contains the item:

A -> B C . D E {L}

Where “{L}” is the look-ahead set, then when we add an item based on a production with “D” in the head, the look-ahead set must be set to FIRST(E), and if E is nullable, this must be unioned with “L”, so

D -> . X Y Z { FIRST(E) + L}

When we create a new state using selected items  from an old state, the LA sets simply get copied across with them.

Now we come to the interesting part. If you look back at how we generate state before you will recall that each time we generated a new state we had to check if it was in fact new, or if it has been seen before.  If it has been seen before we just reused the old state.

Now that we are carrying extra information in a state we need to know what to do with that.  There are two options.

Firstly, we can decide that the look-ahead set is an important part of the state (which it is) and so if the look-ahead sets are different, we have to create a new state.  This will mean that, in general, the LALR process will create more states than the LR(0) or SLR(1) processes.  Often many many more.  If you choose this approach you actually end up with “canonical LR(1)” which is described next.  So let’s not get ahead of ourselves.

If you decide that while important, the look ahead sets shouldn’t justify making new state, then you can instead update the old state by adding in (i.e. set-union) the LA sets of the new state with those of the old state.  This loses a little bit of information and so there can be grammar which will end up with conflicts after the LALR process, but which will work fine with canonical LR(1).  But in many cases there won’t be conflicts, so LALR is enough.

As we have updated the LA sets of a state, and as the LA sets can propagate through to other states, we need to re-process every state where we changed the LA set.  This can result in processing each state twice or even more on average.  It appears to be only a constant increase in effort, so should not be too burdensome.

After we have generated all the new states, and propagated all LA set changes as far as they will go, we have complete LALR(1) LR sets from which we can form action tables.

Again, these might result in completely unique actions for each terminal, or might result in conflicts.  If the former, we say the grammar is LALR(1). If the latter, it is not.

LALR(1) usually provides enough power without too great a cost and seems to be the best trade off.  As noted I am most interested in LALR not for the high quality action tables it produces (I don’t want to use them) but for the fact that it will calculate the LA sets and tell me if there are any cases were both a SHIFT and a REDUCE might be appropriate, and can provide extra information to tell me why a state might have to different possible REDUCE actions.  This information will, I think, be invaluable in understanding the grammar as I am creating it.

Canonical LR(1) grammars

The “canonical LR(1)” style of grammar is the original one described by Knuth before others found ways to simplify it and make it more practical.  For LR(1) “usefully” (as in “until we have a set of symbols that can usefully be reduced”) requires a lot more states than “LR(0)” and all the intermediate forms.  These state reflect the various different possibilities of look-ahead and generally aren’t very interesting.

There certainly are grammars which are LR(1) but not LALR(1), but I don’t think they are very interesting grammars.  Still they might be useful to some people.

The only conflicts that LALR(1) will report but that LR(1) will know are false conflicts are REDUCE-REDUCE conflicts.  i.e. two different reducible productions in the same state.  As I have already said I think this is a design mistake so I have no use at all for LR(1).  If LALR(1) reports a conflict, then I think it is an error even if it is not really a conflict.

To construct the states for LR(1) we follow the same approach as LALR, including an LA set with each item in each state.  However when we look to see if a new state already exists, we require that both the core items and their LA sets are identical.  If they are, then it is the same LR(1) state.  If not,  we need a new state.  No merging of LA sets happens, so we don’t need to repeatedly process each state.

This produces more states.  I have one simple grammar which results in 27 LALR(1) states and 55 LR(1) states.  I have another I’m playing with which currently has 174 LALR(1) states and 626 LR(1) states.

An Implementation.

Theory is never really interesting without practical implementation to show how it works. One of the great appeals of computer programming is that creating practical implementations is quiet easy (unlike economics for example, where experimenting with national economies requires way too much politics!)

So I have written a parser generator that realises almost all of this theory.  It can use any of the different analysis approaches listed above.  The generated parser only uses LR(0.5) though.  I don’t bother building a list of reducible productions and choosing between them based on look-ahead as I don’t think this will ever actually be useful.

Also the generator does not yet handle precedence annotations.  The annotations can be given by they are ignored.  I simply ran out of time and wanted to get this published before yet another month disappeared.  I hope to add the precedence handling later.

The link above is to the version as it is today.  I have other plans to enhance it as will be seen in my next post but I wanted the code linked here to represent only the ideas listed here.

If you want to try this out, you can:

git clone git://ocean-lang.org/ocean
cd ocean/csrc
make; make
./parsergen --LALR calc.cgm
./parsergen --LR1 calc.cgm

This will print a report for the grammar showing all of the states complete with items and LA sets.  This might help with following the above explanation which is, a some points, a little dense.

Note that if you want to create your own grammar, you need a “markdown” formatted file prepared for `mdcode` processing.  Hopefully `calc.cgm` is a usable example.

 

Posted in Language Design | 2 Comments

Lexical Structure

It is traditional to divide the parsing of a program into two conceptually separate stages – the lexical analysis which extracts a stream of tokens from a stream of characters, and the grammar matching stage which gathers those tokens into grammatical units such a declarations, statements, and expressions.  Recognising the value of tradition, Ocean will similarly have separate lexical and  grammatical stages.  This note will focus on the lexical stage.

Future language design decisions will refine may details of the lexical structure of the language, however there is a lot of commonality among current languages and we can build on that commonality to establish a base approach to lexical analysis which can then be fine tuned when the details of the language are known.

This commonality suggests that the grammars of programming languages are comprised of identifiers, reserved words, special symbols, numeric literals, and string literals. Comments need to be recognised but don’t form tokens for the grammar.  White space is mostly treated  like comments – recognised but not tokenised.  However line breaks and leading indents sometimes are significant in different ways, so we need to decide whether and how they might be significant in Ocean.

Identifiers

Identifiers are mostly strings of letters and digits, though they don’t start with a digit. Sometimes other characters are allowed.  This pattern is most easily summarised as a set of characters that can start an identifier and another set of characters that can continue an identifier.  Fortunately Unicode provides well defined character classes “ID_Start” and “ID_Continue” which are suitable for exactly this purpose, providing we base the language on Unicode.

While ASCII was the old standard and is still very useful, Unicode does seem to be the way of the future and it would be foolish to ignore that in the design of a new language.  So an Ocean program will be written in Unicode, probably in the UTF-8 encoding (though that is an implementation detail) and will use ID_Start and ID_Continue as the basis for identifiers.

Beyond that it may be desirable to add to these character sets.  ID_Start does not contain the underscore which many languages do allow.  Neither contain the dollar sign which, for example, Javascript allows in identifiers.

At this stage it seems best to decide that the lexical analysis engine should take a list of “start” characters and a list of “continue” characters to be added to ID_Start and ID_Continue, which can then be used to create identifiers.  ”start” will contain at least the underscore.  The fate of ‘$’ and any other character will have to await future language decisions.

Reserved Words

Reserved words are a specific subset of the class of words that can be used as identifiers. They are reserved for grammatical use, such as “if” and “while” and “else”.  We cannot know at this stage what the list will be, just that there will be a list which will need to be provided to the lexical tokenizer.  These words must only contain valid identifier characters.

Special Symbols

These are like reserved words, but they are made up from characters other than those used in identifiers (or those which mark numbers and  strings which come later).  One could imagine having tokens which combine identifer-characters and non-identifier characters, such as “+A”, but that seems like a bad idea.  It is more natural to see that as two tokens “+” and “A”, and to let the grammar join them together as needed.

There is an important difference between reserved words and special symbols.  Words that aren’t reserved can still be valid identifiers, while symbols that aren’t special are probably illegal. So if “while” is a reserved word but “whilenot” isn’t, then the later is simply an unrelated identifier. However if “=” is a special symbol and “=/” isn’t, then the later is not valid as a single token at all, and we must parse it as two tokens “=” and  ”/”.

So for identifiers and reserved words, we take the longest string of characters that is valid for an identifier and then decide if it is actually a reserved word.  For special symbols we take the longest string of characters which is a valid symbol and stop the current token as soon as we see a character which would not result in a valid token.  Thus any prefix of a special symbol must also be a special symbol.  This should not be a hardship as longer symbols are only chosen when the length is needed to distinguish from something shorter.

So together with the list of reserved words, the lexical engine will need a list of special symbols.  These will probably be ASCII rather  than full Unicode.  While permitting Unicode is important, requiring it would probably be inconvenient.

Numeric literals

There are two questions we need to resolve for numeric literals: what character strings can be used, and what do they mean.  We only need to answer the first here, but it might be useful to dwell a little on the second too.

I really like the idea in the Go language that numeric literals don’t carry strong type information.  They are primarily a value and a value can be used where ever the value fits. So “1e9″ and “4.0″ might be integers, and  ”34″ might be a float.  I intend to copy this idea in Ocean and possibly carry it even further so that “34″ and “34.0″ are completely indistinguishable to the rest of the language (Go can see a difference in a some cases).

Given my principle that any feature available to built-in types should also be available to user-defined types, it follows that if a numeric literal might be useful as an integer or a float, it should also be usable for any other type.  An important example is the “complex” type which might benefit from literal constants.  To express these fully we need to distinguish the “real” from the “imaginary” components and this is typically done using a trailing ‘i’ or ‘j’.

Generalising this suggests that a numeric constant should allow a short suffix which can be used by user-defined types in arbitrary ways.  The exact details of this will have to be left for a later stage of the language design.  For now, numeric constants will be allowed 1 or 2 following letters, which as yet have an unspecific meaning.

There is plenty of  precedent to follow for the numeric part:

  • “0x” means that the following digits are is hexadecimal.
  • “0b” means following digits are binary.
  • A leading “0″ implies octal, unless there is a decimal pointer or exponent.
  • An exponent is “e” or “E” followed by a decimal number which multiplies the preceding number by ten to the given power.
  • A “decimal point” can be use to specify a floating point number.

Not all of this precedent is necessarily useful though.  In particular the leading “o” meaning “octal” is an irregularity in the design as it is really “a leading 0 followed only by digits”.  This usage is also error prone as regular mathematics ignores leading zeros.  So Ocean will follow Python 3 and other languages and use “0o” for octal, making a leading zero only legal when followed by an ‘x’, ‘o’, b’, or ‘.’.

Other possibilities don’t have such strong precedents but are still useful.  For example, “ox” can also introduce a floating pointer number of a hexadecimal point, or a “pNN” suffix gives a power of 2.  There is also the question of whether a decimal point can start a number, or whether a digit is required.

Another possibility to add to the mix is that as long strings of digits can be hard to read it is sometimes permitted to insert underscores to aid readability, much as spaces are often used when writing large numbers in natural-language text.  Some languages allow the underscores anywhere, others impose stricter requirements.  For example in Ceylon a decimal number can only have them every 3 digits, and in hexadecimal or binary they can be every 2 or ever 4 digits.

Finally, some languages allow negative literals, whether by including a leading “-” in the number token, or by allowing a leading “_” to mean a negative literal.

To meet the goal of being expressive, Ocean should allow as many different ways to encode numbers as could be useful.  To meet the goal of minimising errors, it should exclude possibilities that could be confusing and don’t actually add expressivity.

So Ocean will allow “0x” hexadecimal numbers, “0o” octal numbers, and “0b” binary numbers, with or without a “decimal” point and a “p” exponent, and decimal numbers with or with a decimal point and an “e” exponent.  A decimal can only start with a zero if that is immediately followed by a decimal point and in that case the zero is required: a number cannot start with a point.  The  ”p” or “e” exponent marker can also be “P” or “E”, and the number following it is in decimal, with no leading zero, but possibly with a leading + or -.

As this is an experiment as much as a language I will try allowing the use of a comma for the decimal marker as well as a period.  This means that commas used in lists might need to be followed by a space.  Requiring that is a bit ugly, but actually doing it is always a good idea.  If this turns out to be problematic it can be changed.

All of these can have single underscores appearing between any two digits (including hexadecimal digits).  As another experiment, spaces will be allowed to separate digits.  Only a single space (as with underscores) and only between two digits.  Though this might seem radical it is less of a problem than commas as numbers are almost always followed by punctuation in a grammar.

All numeric literals are non-negative – the grammar will allow a prefix negation operator which will be evaluated at compile time.

This numeric part will be evaluated precisely such as can be achieved with the GNU “libgmp” bignum library.  Conversion to machine format will only happen at the last possible point in language processing.

String Literals

String literal are simply arbitrary sequences of characters surrounded by some quote character.  Or at least, almost arbitrary.

It can improve language safety to not allow literal newline characters inside string literals. Thus a missing close quote is easily recognised.  When multi-line string constants are required, one effective approach is the triple-quote used by Python and other languages. This is common and well understood, so Ocean will borrow the idea.

Inside a string, escapes may or may not be permitted.  These are  typically a back slash (“\”) followed by one or more characters.  The minimum requires would be “\n” for a newline, “\t” for a tab, and “\ooo” with octal digits for an arbitrary ASCII character.  We could go to extremes and allow “\U{Name of Unicode Character}” which allows the use of long Unicode character names.

This is an area where extension might be useful so it is important that any construct that might be useful in the future should be illegal at first.  So all escapes not explicitly permitted are in error.

One particular feature of strings that I personally have never liked is that to include a quote inside a quoted string, you escape the quote.  This means that the simple rule “characters between quotes make a string” doesn’t apply and any parsing engine must recognise the escape character.  As this is my language I will chose that quotes cannot be escaped.  Instead Ocean will allow “\q” which will expand to whichever quote character starts and ends the string.

The escape sequences understood will initially be:

  • “\\” – the “backslash” escape character.
  • “\n” – newline
  • “\r” – return
  • “\t” – tab
  • “\b” – backspace
  • “\q” – quote
  • “\f” – formfeed
  • “\v” – vertical tab
  • “\a” – alert, bell.
  • “\NNN” – precisely 3 octal digits for an ASCII character (0-377)
  • “\xNN” – precisely 2 hexadecimal digits
  • “\uXXXX” – precisely 4 hexadecimal digits for a Unicode codepoint.
  • “\UXXXXXXXX”- precisely 8 hexadecimal digits for a Unicode codepoint.

There are three quote characters that are sometimes used: single quote, double quote, and back quote.  Ocean will almost certainly allow all three.  The lexical analysis code will certainly recognise any of the three that haven’t been claimed as a special symbol.  Though I’m leaving this open at present, I expect that strings surrounded by backquotes will not have any escape processing done, while the other strings will.

If two quote symbols appear together, that is simply an empty string.  If three appear, then it is a multi-line string.  The next character must be a newline and the closing triple quote must appear on a line by itself.  The newline after the first triple quote is not included in the string, the newline before the second triple quote is.

Quoting is permitted inside multi-line strings in the same way as with single-line strings, so triple-backquote should be used to avoid all escape interpretation.  In a multi-line string that allow escapes, an escape character  at the end of the line hides that newline character.  This is the only way to achieve a multiline string that does not end with a newline character.

With multi-line strings comes the question of indenting.  It would be nice to be able to indent the whole string to line up with surrounding code, and to have the indent stripped when the parsed token is turned in to a string constant.  But the key question is : how exactly to specify what to strip.

One option would be to strip the same indent as was on the line that contains the start of the token.  However it is nice for values to be more indented than – say – the assignment that uses them.

Another option would be to strip the same indent as was on the line that contains the end of the token.  This places the programmer in more control and probably makes lots of sense.  It would only be safe to strip the exact sequence of characters that precede the close quote.  So if there is a mix of tabs and spaces, it must be the same mix for every line.

I’m not 100% comfortable with this, but it is certainly worth a try, so it will be how the first  attempt works.

Comments

There are three popular styles of comments:

  • /* Block comments from C and others */
  • // single-line comments from C++ and BCPL
  • # single-line comments from C-shell and Bourne shell.

While others exist, these seem to be most popular, and others don’t add any extra functionality.  For the block comment, it is important to note that nesting is not permitted.  That is: the character pair “/*” may not appear inside a comment.  Allowing it does not significantly improve expressivity, and excluding it guards against forgetting or mis-typing the closing sequence “*/”.

It seems best to simply allow all of these forms of comments.  The only negative impact this might have is that it means the 3 introductory notations could not be used elsewhere, and of these, the “#” is the only symbol with significant uses in existing languages.  Where it is used, it is often for extra-language functions such as macro pre-processing. Features provided by these extra-language notations are probably best included in the main language in some way.  So at this stage, the plan for Ocean is to treat all three of these forms as comments.

Line breaks and indents

The use of context-free grammars to describe languages led to line breaks being largely ignore as syntactic elements.  Rather a program is seen as a linear sentence in the language, and spacing it just used to improve readability.

However this approach allows the possibility of the visual formatting becoming out-of-sync with the actual structure of the program.  Further it results in syntactic elements, particularly semicolons, which are required by the language but appear unnecessary to the human reader.

Various approaches to address this have been tried in different languages.  Of these I find that in Python the nicest.  It may not be perfect, but recognising both the presence of line breaks and the level of indents seems valuable.

We will explore this issue more deeply when we look at using a grammar to parse the token sequence, but for now we need to know what tokens to include, and what they mean.

The key observation needed at this stage is that when a block of text is indented, it is somehow subordinate to the preceding line that was not indented.  Thus that first line doesn’t completely end until all the indented text has been processed.

So every line break will generate a “newline” token, however if subsequent lines are indented, the “newline” token will be delayed until just before the next line which is not indented.  Together with this, “Indent” tokens will be generated at the start of each line which is indented more than the previous line, with matching “Undent” tokens when that indent level terminates.  In consequence of this, indenting can be used to bracket syntactic objects, and to whatever extent newlines terminate anything, they only do so after any indented continuation is complete.

So for example

if A
   and B {
       X
       Y
       while C do
            Z
}

would parse as

if A INDENT and B { INDENT X NEWLINE Y NEWLINE while C do INDENT Z
NEWLINE UNDENT NEWLINE UNDENT NEWLINE UNDENT NEWLINE } NEWLINE

Note that INDENT and UNDENT come in pairs, and the NEWLINE you would expect before an INDENT comes after the matching UNDENT.  This will allow a multiple UNDENT, such as that following the “Z” to terminate a number of syntactic units.

Interaction with literate programming

As previously discussed an Ocean program is, at least potentially, presented as a literate program which may have several code blocks from various parts of the document.  The above discussion of lexical analysis has assumed a simple sequence of characters, while in actual fact with have a sequence of blocks of characters.  Does that make a difference?

As blocks are always terminated by newlines, and as indenting is carefully preserved across blocks the only issues that could arise would be with lexical elements which are contain newlines, but aren’t newlines themselves.  Of these there are two.

Firstly, block comments can contain newlines.  Therefore, is it acceptable for a comment that starts in one code block to be continued in another?  Almost certainly this would be a mistake and would lead to confusion.  It is possible that it might be useful to “comment out” a block of code, however it that is important, then the language should have more focused means to achieve the same without confusion.   Our literate programming syntax make it possible to mark a block as an “Example” to not be included in the code, and a section reference line can easily be commented out with a single-line comment.  So comments must not span literate program code blocks.

Secondly, multi-line strings can contain newlines.  For much the same reasons as with comments, these strings will not be permitted to span literate program code block.  The fact that we strip the indent of the closing quote from all lines makes interaction with literate programming much cleaner than leaving all the indentation in.

Trying it out.

Having lots of ideas without any code isn’t much fun.  So to accompany this blog is my first release of a token scanner – written as a literate program of course.

It is somewhat configurable so that if I change my mind about spaces in numbers they are easy to remove.

This program is one of the reasons that this post took so long to arrive.  I had written a scanner prior to staring in the post, but wanted to re-write it as a literate program and wanted to incorporate ideas I had gained since writing it.  This took quite a while as I’m still new the the literate programming approach and my first draft had a rather confused structure.  However I’m really happy with the result and I think the code and design are better for the extra effort put in.

Posted in Language Design | 4 Comments

Literate programming?

I was going to start out by describing the lexical structure of Ocean, but as I thought more about it, I realised there was something that needed to come first.  And that something is literate programming.

Literate programming is the idea – popularised by Donald Knuth – of writing programs like pieces of literature.  They can be structured as a story or an exposition, and presented in a way that makes sense to the human rather than just to a computer.

The original implementation – known as “web” with two tools, weave and tangle – took a source document and converted either to TeX for elegant printing, or to Pascal for execution.  So the “web” input was not read directly by the compiler.

I first came across the idea of the language being prepared for literate programming with the language “Miranda”.  A Miranda program can come in a form where most text is treated as comments, and only suitably marked pieces of text (with a “>” at the start of the line) is treated as code.  I like this idea – the language should acknowledge the possibility of literate programming from the start.  I want Ocean to acknowledge it too.

Obviously a language cannot enforce a literate style, and I don’t think I would want it to try.  I am a big believer in “write one to throw away”.  The first time you write code, you probably don’t know what you are doing.  You should consider the first attempt to be a prototype or first draft, and be ready to throw it away.  There probably isn’t a lot of point trying to weave a prototype into a story.

But once you have working code and have got the algorithms and data structures sorted out in your mind, I think there is a lot of value in re-writing from scratch and presenting the whole more clearly – as a piece of literature.  So I see two sorts of programs: prototypes that just have occasional comments, and presentation code which is written in full literate style.  I find that having an audience to write to – even an imaginary one – motivates me to write coherently and close the gaps.  Presenting code as a piece of literature may equally motivate good quality.

Having two different forms at first suggested to me the idea of two different file suffixes to maintain a clear different, but I no longer lean that way. It might be nice to allow two different suffixes so a programmer could create “hello.ocd” – the “OCrean Draft” first, and then create “hello.ocn” the final program as a separate file, but having the compiler impose different rules on the two files would likely make programmers grumpy, or at least confused.  So as long as there is a simple syntax marker at the top of the file which effectively says “this is all code”, that should be all that the compiler needs to care about.

The  key elements of literate programming are to have a language for presenting the text, a way to extract the code from among that text, and some sort of cross-reference mechanism for the code.

This last is important because presenting code as a story doesn’t always follow a simple linear sequence – it is good to be able to jump around a bit.  We might want to tell the overall story of a function, then come back and explain some detail.  The extra detailed code should appear with the extra detailed explanation.  It may be possible to use language features to do this, but at this stage of the design I am far from certain.  So I want the literate-programming model to allow rearrangement of code through cross references.

While TeX has many advantages for presenting text and produced beautiful output, I do find it a bit clumsy to use.  It is maybe too powerful and a little bit arcane.  A popular language for simple markup currently is “markdown”.  The input is quite easy to read as plain text, and there are plenty of tools to convert it into HTML which, while not particularly elegant, is widely supported and can be tuned with style sheets.

“markdown” has the added advantage of have a concept of “code blocks” already built in so it should be easy to extract code.  Using “markdown” for literate programming is not at all a new idea.  A quick search of the Internet finds quite a few projects for extracting executable code from a markdown document.  Several of them allow for cross references and non-linear code extraction, but unfortunately there is nothing that looks much like a standard.

This lack of a standard seems to be “par for the course” for markdown.  There is some base common functionality, but each implementation does things a bit differently.  One of these things is how code sections are recognised.

The common core is that a new paragraph (after a blank line)  which is indented 4 or more spaces is code.  However if the previous paragraph was part of a list, then 4-spaces means a continuation of that list and 8 spaces are needed for code.   In the Perl implementation of markdown, lists can nest so in a list-with-a-list, 12 spaces are needed for code.

Then there are other implementations like “github markdown” where a paragraph starting with ``` is code, and it continues to the next ```.  And pandoc is similar, but code paragraphs start with 3 or more ~~~ and end with at least as many again.

On the whole it is a mess.  The github and pandoc markings are probably unique enough that accepting them won’t conflict with other standards.  However while the Perl markdown recognises nested lists, the python markdown doesn’t.  So code extracted from markdown written for one could be different from markdown written for the other.

But if I want Ocean language tools to be able to work with markdown literate programs (and I do) then I need to make some determination on how to recognise code.  So:

  • A paragraph starting ``` or ~~~ starts code, which continues to a matching line
  • An indented paragraph after a list paragraph is also a list paragraph.
  • An indented (4 or more spaces) paragraph after a non-list paragraph is code

This means that code inside lists will be ignored.  I think this is safest – for now at least.  If experience suggests this is a problem, I can always change it.

This just leaves the need for cross-references.  Of the examples I’ve looked at, the one I like that best is to use section headings as keys.  markdown can have arbitrarily deep section headings so a level 6 or 7 section heading could cheaply be placed on each piece of code if needed.  So I’ll decide that all code blocks in a section are concatenated and labelled for that section.

While cross references are very valuable, it can sometimes be easier to allow simple linear concatenation.  An obvious example is function declarations.  Each might go in a separate section, but then having some block that explicitly lists all of those sections would be boring.  If anything, such a “table of contents” block should be automatically generated, not required as input.  So it will be useful if two sections with the same, or similar, names cause the code in those two (or more) sections to be simply concatenated.

It is unclear at this stage whether it should be possible to simply reuse and section heading, or whether some sort of marker such as (cont.) should be required.  The later seems elegant but might be unnecessary.  So in the first instance at least we will leave such a requirement out.  Section names for code can be freely reused.

Inside a code block we need a way to reference other code blocks.  My current thinking is that giving the section name preceded by two hash characters (##)  would be reasonable.  They are mnemonic of the section heading they point to, and are unlikely to be required at the start of a line in real code.

These references are clearly not new section headings themselves as they will be indented or otherwise clearly in a code block, where as section headings are not indented and definitely not in code blocks.

So this is what literate Ocean code will look like: markdown with code blocks which can contain references to other code blocks by giving the section name preceded by a pair  of hashes.

A non-literate Ocean file is simply one which starts with ``` or ~~~ and never has a recurrence of that string.

To make this full concrete, and as a first step towards working code in which to experiment with Ocean, I have implemented (yet another) program to extract code from markdown.  Following my own advice I wrote a prototype first (in C of course, because I can’t write things in Ocean yet) and then re-wrote it as a literate program.

I found that the process more than met my expectations.  Re-writing motivated me to clean up various loose ends and to create more meaningful data structures and in general produced a much better result than the first prototype.  Forcing myself to start again from scratch made it easier to discard things that were only good in exchange for things that were better.

So it seems likely that all the code I publish for this experiment with be in literate style using markdown.  It will require my “md2c” program to compile, until Ocean is actually usable (if ever) in which case it will just need to Ocean interpreter/compiler which will read the markdown code directly.

I find that my key design criteria of enhancing expression and reducing errors certainly argue in favour of the literate style, as the literate version of my first program is both easier to understand and more correct than the original.

This program can be read here or downloaded from my “Ocean” git repository at git://ocean-lang.org/ocean/.

Posted in Language Design | Leave a comment

The naming of a language

“The naming of cats is a difficult matter, It isn’t just one of your holiday games.”  The naming of programming languages is also important.  As with any project a name is needed to be able to refer to, and it inevitably will set expectations and flavour to some degree.

I’ve had a few different thoughts about names.  My first idea was “plato”.  Plato was a philosopher and is particularly known for drawing a distinction between the real and  the ideal.  All things in the real world, circles and squares and so forth, are just poor shadows of the perfect circles and the perfect squares that exist in the ideal, or “Platonic” plane.

I actually think Plato had this backwards (though I haven’t read his work so quite possibly misunderstand him and  misrepresent his ideas here).  To my mind the ideals that we think about are poor approximations to the real world which, after all, is the reality.  The process of thinking (which is what philosophy tries to understand) involves creating abstractions that map on to real world objects and events, and in trying to find the abstractions that are both general enough to be useful, and precise enough to be truthful.

I see the role of a programming language being to fill exactly this gap.  It needs to address real world problems and tasks, but does so by generalising and abstracting and treating them mathematically.  In a sense the program exists in the Platonic plane while the implementation exists in the real world, and the language has to ensure effective communication between the two.

So “plato” is not a bad choice, but it isn’t the one I’m going to use.  I actually think “plato” would be a great name for a “platform” – like “Android” or “Gnome” or whatever.  They both start “plat”…

My next thought was to name it “Knuth” after Donald Knuth who has had some influence of my thinking as you will see in future articles. The naming of language after dead mathematicians has some history (with Pascal and Ada at least), but as Mr Knuth is still alive, using the name “Knuth” doesn’t fit that pattern.  And it would probably be a bit pretentious to try to use such a name for a little project such as this.  So that name is out.

While dwelling on my real motivation for this language, I realised that it really is quite strongly influenced by my personal experience of the last few decades of programming.  This should be no surprise, but it is worth acknowledging.  It is easy to pretend that one is being broad minded and considering all possibilities and creating a near-universal design, but that is only a pretence.  The reality is that our values are shaped largely by our past hurts, and these can only come from our past experience.  I must admit that I am escaping from something, and that something is primarily “C”.

I’ve used C quite happily since the mid ’80s and enjoyed it but have always been aware of deficiencies and it is really these that I want to correct.  I’ve watched other language appear and evolved and there have been good ideas but I’ve not found any really convincing.  Python has a lot going for it and I tend to use it for GUI programming, but when I do it just reminds me how much I like static typing.

So this language is to be my escape from C (at least in by dreams) and should be named as such.

C is seen to be a successor of B, which in turn grew out of BCPL.  So the joke at one time was to ask whether the next language could be “P” or “D”.  Of course it turned out to be “C++”, a joke of a different kind.  And then “D” came along anyway.

What do I want to call my successor of “C”?  The answer is easily “Ocean”.  Oceans are critical to life in many ways, but dangerous too – they need to be understood  and tamed.  Oceans are big and wide with many unknown and unexpected inhabitants.  If I want an arbitrary name for something else related to “Ocean”, I can use “Pacific” or “Indian” or “Atlantic”.  And of course an “Ocean” is like a “C”, but more so.

Having admitted that Ocean will follow on from C in some ways, I should explore a little what that means.

Primarily it means that Ocean will be a compilable language.  I’m not at all against interpreting  and JIT compiling but I don’t like to require them.  The runtime support code should not need to include an language parser, unless explicitly requested.  This means for example that a function like “eval”, which can be given some program text is completely out.  Similarly interpolating variable  names into strings with “…${var}…” is not an option.

Some degree of introspection is probably a good idea – I haven’t really decided yet – so it may be possible for a program to manipulate language objects.  But this must not be part of the core language and it should only exist as a library for programmers with particular needs who are willing to pay the cost.

It also means that the programmer should have a lot of control.  I’m not sure exactly what this means yet, but in general the programmer should feel fairly close to the hardware, and have an easy clear idea of when runtime support with help out and when it will stay out of the way.  Certainly the program should have a fairly clear idea about how their constructs use memory and use CPU.

Static typing is a “must have” for me.  This is essential for the compiler to be able to find bugs, and I trust compiler coverage a lot more than test coverage (though that is important too).  There is certainly room for runtime type flexibility such as variant records, or values which can be real or NULL.  These need to be available, but they should not be the default.

So that is what C means to me: static typing, compilable, and fine control.  And that is what “Ocean” must contain – at least.

Now to be fair I must address the question of whether and these early design decisions fit with my philosophy stated early – particularly aiding clarity and minimising errors.

Static typing is almost entirely about minimising errors.  By having types declared that the compiler can check, fewer mistakes will make it to running code.  The equally enhance clarity by making clear to the reader what type is intended for each value.

“Fine control” is sufficiently vague that it could mean anything.  I justify it by saying that it allows clear expression of precise low-level intention.

“compilability” really hinges on the  lack of “eval”, though static typing is often related.  “eval” effectively permits self-modifying code, and this is extremely hard for the compiler to assert anything concrete about at all.  So I feel fairly comfortable asserting that “eval” is a great way to introduce hard-to-detect errors, so it should be avoided where possible.  If some limited for of “eval” turns out to be particularly valuable, that can certainly be revisited when the time comes.

So while my language has no content, it now has a name: Ocean, and even a website: http://ocean-lang.org/.  Anything could happen next… but it will probably be something lexical.

Posted in Language Design | 4 Comments