Parsing a two-dimensional language.

Many programming languages are essentially one dimensional.  The parser treats them simply as a linear sequence of tokens.  A program could all be written on a single line, or with each token on a separate line and the parser or compiler wouldn’t notice the difference.  This set of languages includes Algol, Pascal, C, Rust and many others.

Some languages are 2-dimensional in a bad way. FORTRAN is probably the best example, though BASIC is similar.  These (at least in their early forms) had strict requirements as to what can go on one line and what needs to go on a separate line.

A few languages are exploring the middle ground.  Go will treat Newlines like semi-colons in certain cases which can result in a 2-dimensional feel, but brings with it some rather ad-hoc rules. Python probably makes the best attempt at 2-dimensional parsing of the languages that I have looked at.  It allows newlines to terminate statements and also uses indents to indicate the grouping of some language elements, particularly statement groups.

While I find a lot to like in Python, it seems imperfect.  I particularly dislike using a backslash to indicate line continuation.  You can avoid this in Python by putting brackets around things as a newline inside brackets is ignored.  But this feels like a weakness to me.

As I wrote in a recent article for lwn.net:

The recognition of a line-break as being distinct from other kinds of white space seems to be a clear recognition that the two dimensional appearance of the code has relevance for parsing it. It is therefore a little surprising that we don’t see the line indent playing a bigger role in interpretation of code.

This note is the first part of a report on my attempt to translate my intuition about parsing the two dimensional layout into some clear rules and concrete code.  This note deals with indents.  A subsequent note will look at non-indenting line breaks.

Indenting means continuation, newline means completion.

The two elements that give a section of code a two dimensional appearance are line breaks and indents.  Without the line break there wouldn’t be any extra dimension at all.  Without indents the code would just look like a block of text and we would be in much the same situation as FORTRAN and BASIC.

Each of these has a fairly well understood meaning.  A line break often marks the end of something.  This isn’t always true, and exactly what the “something” is might be open to interpretation, but this meaning is fairly well understood and line breaks rarely mean anything else.

An indent means some sort of continuation.  The line that is indented with respect to a previous line is in some sense part of that previous line.

To help visualise this, consider:

A B C
   D E
   F
G H

Here “D E” and “F” are clearly part of the thing which starts “A B C”. “G H” may or may not be, much as “F” may or may not be part of “D E”.  However if “D E” or even “B C” started something, then it definitely finished by “F”.  “G” might be part of something that started a “A”, but if some subordinate thing started at “B” or later, it does not continue into “G”.

As a first step to formalising this we need to be able to encode this two dimensional structure into the stream of tokens which the parser can then process.

Generating a “NEWLINE” token at the end of each line, and maybe a “SPACE” token for each space at the start of a line could capture it but not in a very helpful way.  It is not really the presence of an indent that is important but rather the change in indent.  So between “C” and “D” we need to know that the indent level increased, while between “F” and “G” it decreased.

Python uses the terms “INDENT” and “DEDENT”.  The first is technically inaccurate as every line is indented, not just the first.  The second is a neologism presumably derived by analogy with “increase” and “decrease” or similar.  Other options might be “UNDENT” or “OUTDENT”.  I’m not sure I really like any of these.  However “in/out” seems the best fit for what is happening so I’ll just call the tokens which report indent changes “IN” and “OUT”.

Using these two tokens, and ignoring the line structure for now, the above structure can be tokenized as:

A B C IN D E F OUT G H

This is very simple as there is only one indented block. A more complex example is:

A B C
  D E
   F G
 H
I

This would be tokenized as:

A B C IN D E IN F G OUT OUT IN H OUT I

 

Note the sequence “OUT OUT IN” – we close two indented section and open another one all at one line break.  This is because “H” has a lesser indent than “D”, but a greater indent than “A”.

This tokenization captures the structure very effectively and is what I will be using. Unsurprisingly, these are exactly the tokens that can be generated by the scanner I previously described though the token names have been changed.

What to do with indents?

Now that we have these IN and OUT tokens we need to understand what to do with them.  Their significant value is that they are very strong syntactic markers.  IN and OUT are strongly paired: it is very obvious which OUT matches which IN.  This contrasts with bracketing markers like “{ }” or BEGIN and END or DO and OD.  The latter can be accidentally deleted or forgotten.  If OUT is forgotten, that omission will be obvious on every line.

I’ve recently be writing C code using the “text” mode of my editor (emacs) rather than the C mode.  This is because I’ve been experimenting with literate programming and writing the code inside `markdown` markup.

I have really noticed that lack of help with auto-indenting.  Each line is auto-indented to the same level as the previous line, which is certainly useful.  But typing `}` doesn’t automatically drop back one level of indent.  So sometimes I get it wrong.

When I do get it wrong, the compiler complains (as it should) but it is usually a cryptic complaint that is way too late.  I would much rather it know there is a problem as soon as the indenting looks wrong.  So error detection will be the first focus for exploring the use of indents.

To discuss how this works you really need a working knowledge of grammars and parsing.  If you don’t have that yet, let me suggest me earlier note of LR grammars. If you don’t have time for that now, then press on anyway, it shouldn’t be too hard.

Unlike other tokens used in an LR grammar, IN and OUT cannot simply appear in the grammar where ever they are expected.  This is because there are really too many places that an IN can appear.  An OUT must be at the end of something, but that matching IN can often be anywhere in the middle.  So the parser will need to have direct knowledge of IN and OUT tokens.  It will not “SHIFT” them onto the stack as it does with other tokens.  It has to handle them directly.

As an OUT enforces the end of something the natural thing to do when an OUT token is seen in the look-ahead is to perform REDUCE actions until the OUT is balanced by some IN that was previously seen.  This is clearly a very imprecise rule.  We need to explore some examples before we can make the rule suitably precise.

We can start looking at examples supposing a grammar containing:

Statementlist -> Statementlist Statement
               |
Statement -> if Condition : Statementlist
           | print String

A program fragment like:

if a == b :
   print "Hello"
   print "World"
print "End"

would then result in the following LR stack just as the OUT token appears in the look ahead:

if Condition : Statementlist print String   [OUT]

The IN appeared just before the Statementlist and the only possible reduction is to reduce”Statement -> print String” so we would do that which results in

if Condition  : Statementlist Statement    [OUT]

Again we can reduce and still stay after the indent so now we reduce “Statementlist -> Statementlist Statement” and get to:

if Condition : Statementlist   [OUT]

Now the symbol on the top of the stack is where the indent started, so we need to stop reducing.  It might seem appropriate to reduce this fully to a “Statement” but we cannot do that based on just the indent.  Consider a different language with a fragment like:

if ( a == b ) {
    print "Hello";
    print "World";
}

In this case it is clear that the OUT itself is not enough to close the whole statement.  Resolving this will have to wait until we explore how line-structure works.

So the approach for handling IN and OUT seems to be that when we see an IN, we mark the next symbol SHIFTed as being indented.  Then when we see OUT we stop SHIFTing and REDUCE until the symbol on the stop of the stack is marked as indented.  We can then clear that marking, discard the OUT, and continue.

This is close to a working solution but there are a few complications that need to be resolved first.  One relates directly to the fact that “Statementlist” is recursive.

Statementlist -> Statementlist Statement

This means that reducing until we just have “Statementlist” on the top of the stack doesn’t stop more things being added to that Statementlist. The idea was that the OUT should complete close the Statementlist but because of its recursive nature that isn’t really possible.

This is easily fixed by introducing a second “dummy” nonterminal.

Statementlist -> _Statementlist
_Statementlist -> _Statementlist Statement
                |

This set of productions will parse exactly the same language but will introduce an extra state.  If we see an OUT token we can reduce back to a “_Statementlist” as we did before, and then go one more step and reduce to the new “Statementlist”.  Once there we really have closed the list.  Another statement cannot extend it like it could in the simpler grammar.

These extra “dummy” non terminals could easily be added by the parser generator itself.  However they solve one problem by providing another.  Previously we could simply REDUCE until the top of stack carries an IN, and the cancel the IN.  Now we might need to REDUCE one more step – the “Statement -> _Statement” production above. How can we know when to stop?

The answer to this requires that we store one more bit (literally) of information in the state stack.  We currently have a count of the number of indents which appear “within” (the full expansion of) the symbol in the stack entry.  By that we mean the number of terminals which were reduced into this symbol and were immediately preceded by an IN. To that we must add a flag to record whether the first terminal was preceded by an IN.  i.e. is the whole symbol indented or just part of it.

Now the rule will be that if we have an OUT token in the lookahead buffer and the top-of-stack token contains an IN that was not at the start then we can simply cancel the IN and discard the OUT.  Further, if the top-of-stack contains only one IN and it was at the start then we need to look at the production that we can reduce.  If that production contains more than one symbol in its body then again we can cancel the IN and discard the OUT.  In this case, that IN would be not at the start of the symbol we would reduce to, so it is somewhat similar to the first case.

Another way to look at this is to say that when we see an OUT token, we reduce until the next reduction would contain an internal (i.e. not at the start) IN.  Once we find this internal IN we cancel it and discard the OUT.

Indenting affect error handling in two ways.  Firstly we need to know what happens if an OUT implies a reduction is needed, but the current state does not allow a reduction.  Secondly we need to know how to handle IN and OUT tokens if they are found while discarding input tokens during normal error recovery.

In the first case, the fact that we want to reduce suggests that we must have part of the right hand side of a product but not all of it.  It would be nice to keep what we have, so to complete it we need to synthesize the extra tokens needed to make a complete  production.  If there is no reducible production at this state then there must be symbols that can be shifted.  So we can synthesise one of these – the one  that gets us as close to the end of a production as possible.  This may result in an ‘unknown variable’ or similar which would need to be carefully suppressed, but would allow any valid units with the production so far to be properly analysed.

For the second case we can simply count the INs and matching OUTs until we find a SHIFTable symbol or find more OUTs than INs.  If we find a SHIFtable symbol, we set the IN count for that symbol to the net number of INs.  If we find an extra OUT it means that we must have come to the end of whatever high level non-terminal was in error.  We we just leave the ERROR token on the stack without shifting anything else and move on to normal handling for an OUT token.  This will either reduce whatever is on the stack, or will synthesize some token to allow parsing to continue.

It should be noted here that I haven’t experimented a lot with this error handling process.  It seems to make sense, but often things which seem to make sense actually don’t.  So this might need to be revised.

This then is the complete description for how to handle IN and OUT tokens.  In summary:

  1. Each frame in the LR state stack must contain a count of the number of IN tokens it contains, and a flag to record if one of these is at the start.
  2. When we see an IN token we record the fact and discard it.  When we SHIFT in a new terminal we check if we just saw an IN.  If so the new entry is the stack records that one IN is present, and that it is at the start.
  3. When we reduce a series of symbols to a single non-terminal we add up the number of included IN tokens in each and use that as the count for the new token.  The “one at the start” flag from the first symbol is inherited by the new symbol.
  4. When we see an OUT token we check if the top-of-stack contain a “not at the start” IN token, or if it contains an “at the start” IN but the next reduction would make it “not at the start”.  If either of these are true we decrement the IN counter and discard the OUT.  Otherwise we REDUCE a production if we can, and synthesize some suitable  symbol to SHIFT if we cannot.
  5. When we discard stack states during error handling, we count the number of indents and add them to the “ERROR” state that we ultimate SHIFT.
  6. When we discard look-ahead tokens during error handling we keep count of and IN and OUT tokens.  If we see an OUT token when the count is zero, we assume a sane state has been found and return to where step  4 above was handled.  Otherwise the IN count of the token which ultimately resolves the error is set to the balance of INs and OUTs.

The net result of this is that an OUT always closes anything that was started at or after the matching IN, but nothing that was started before the matching IN.

This adjunct to LR parsing can be used in two different but related ways.  Firstly as already noted it can improve error reporting.  If a ‘closing’ token like “}” is missing then that will be noticed immediately.  The syntactic unit that it is meant to close (e.g. Statementlist) will already have been reduced by an OUT and if the “}” doesn’t follow a regular LR parsing error will occur.

Secondly, and more interestingly for language design, an ambiguous grammar can be cleanly resolved by treating indentation as meaningful.

The classic “dangling else” case can be easily resolved by the indentation of the ‘else’.

if (COND1)
    if (COND2)
        StatementA;
else
    StatementB;

is no longer ambiguous.  The “else” clearly relates to the first “if”, not the second.

Ocean will make use of this to have a syntax for “blocks” of statements which does not require a closing token.  They will be closed by a reduction in indenting.  As explicit closing can sometimes be preferred, Ocean will also allow C-style blocks.  Possibly that should be “go-style” blocks as a block must not be a single statement.

The syntax I’m experimenting with at the moment includes:

Block -> : Statementlist
       | { Statementlist }
Statement -> if Expression Block
           | if Expression Block ElsePart
           | ...
ElsePart -> else Block
          | else if Expression Block
          | else if Expression Block ElsePart

This allows structured statements that look like Python or like Go, and will use indentation to resolve ambiguities in the Python style, and report errors in the Go style.

My intention is that several other structures will allow either “{}” or “:..” formats.  “struct” type declarations is an obvious one.  Others might occur as design proceeds.  However before we can proceed with those details we need to finish the two-dimensional parsing which is only half done.  This note dealt with indents, the next note deals with end-of-line marker.

Meanwhile code that manages indents as described here can be found in `git://neil.brown.name/ocean` or this version of the parser generator.

Wiggle 1.0

About 11 years ago I started writing “wiggle”.  I have finally released version 1.0.

Wiggle is a tool for applying patches with conflicts.  I do quite a bit of applying patches to a release different to the one they were created for.  This often works, and often fails completely so that the task must be done completely by hand.

But sometimes – and often enough to be worth automating – the ‘patch’ program cannot apply the patch successfully, but the reason is due to small differences that can automatically be detected and ignored.  This is where “wiggle” comes in.  It looks at the patch word-by-word, and wiggles that patch in if at all possible.

I got wiggle to the point where this basic functionality worked a long time ago.  But I found it less that satisfactory.  Often “patch” would fail and “wiggle” would succeed and I would want to know why.  And “wiggle” didn’t really tell me why.

This lead me on the path of writing “browse” mode for wiggle. This allows me to look at the merge and see the different bits highlighted so I can see what confused “patch” and decide if it was reasonable for wiggle to silently fix it.

I then found that I wanted to be able to “fix” things up while in browse mode.  Some little conflicts that wiggle cannot fix automatically still have simple fixes that I could determine based on my understanding of the code.  Figuring out how to do that was not easy.  I knew in general terms what I wanted to achieve, but had no real idea of the concrete steps.

So much of the last few year, when I have had the opportunity, has seen me improving various aspects of the browser, making it easy to move around and handling various corner cases and getting enough ground work in place that I could usefully experiment with editing. Only recently have I taught wiggle some very basic commands for editing.  I don’t really know if they are enough, but they will do for now.

There are basically 3 things you can do.

  1. You can mark a conflict or change and “unchanged”.  This effectively ignores part of the patch and leaves the original as it is.
  2. You can mark a conflict or ‘unmatched’ as ‘changed’.  This effectively discards the original and keeps what was in the patch, even though it might not have lined up properly.
  3. After the above has proven to be insufficient, you can open the merge result in an editor and fix the rest up by hand.

Obviously the last possibility has always been an option, though not directly from wiggle.  This is how I most often work with failed patches.  I apply “wiggle” then look at the results in an editor.  Allowing the editor to be invoked directly from wiggle should streamline the process a little.  Allowing some simple changes before hand should make it even better.  Only time will tell.

So wiggle 1.0 is out there, in git://neil.brown.name/wiggle/ or http://neil.brown.name/wiggle/.  If you want to know more, you can see the talk I recently have at LCA2013.

Parsing LR grammars

It is time for another diversion, but it is leading towards the language design – honest.

LR grammars are a tool for specifying the grammar for a language, and it is fairly easy to automatically generate a parsing tool from a grammar.  So they have often been used for this purpose.

There seems to be some suggestion that LR is no longer to tool of choice (see wikipedia, apparently because it is hard to do good error reporting.  The gcc compilers converted from an LR grammar to a recursive descent LL grammar, apparently for this reason.

However I like LR grammars and I don’t see the problem.  So I plan to start out with an LR grammar approach.  Either it will work well, or I will start the see the problem, and either out come in positive in my view. So I cannot lose.

There are plenty of tools available for converting an LR grammar into code, of which GNU “bison” is probably best know, but I’m going to write my own.  Partly this is because it will help me learn, and partly because it will add to the collection of code that I have recent experience with so that when Ocean starts to take shape I will have lots of example code to try translating to it.  But mostly because it is fun.

So this post is about LR grammars and building a tool to parse them.  You’ve probably read all this before elsewhere, so I suggest you stop reading now and go do something useful like play sudoko.  Come back in a few weeks when I will have moved on.

However, it is also possible that you have tried to read about this and, like me, failed to find anything really accessible.  There are academic papers that are full of dense mathematics which makes an interesting topics appear unapproachable.  And there are lecture notes which seem to skim over the really interesting details.  And of course there is wikipedia, which isn’t exactly bad, but isn’t exactly good either.  If that is you, then please read on and see if my contribution helps you at all.

LR Grammar Basics

A grammar is a set of rules for mapping between structured data and a linear representation of that data.  The grammar can allow you to “produce” a linear form from a structure, or “parse” a linear form to recover the structure.

We use grammar whenever we convert thoughts and idea to sentences and stories.  We use the same grammar in reverse when we convert sentences and stories that we hear and read back into thoughts and ideas.  Of course the grammar of natural language is very complex and context-dependent, and each individual works with a slightly different version of the grammar.  So the conversion of ideas to and from natural language is imperfect and error prone.  It is however extremely powerful.  It allows me to write this note, and you to get at least some of my ideas while reading it.

For computer  languages we use much simpler but much more precise grammars.  Noam Chomsky famously did a lot of work in formalising many ideas of grammar and it is him we thank for the “context free” grammars that work so well for programming languages.

A context free grammar (CFG) consists of a set of symbols and a set of productions.  The symbols are grouped into two subset: the terminal symbols and the non-terminal symbols. There  is one distinguished non-terminal symbol called the “start” symbol.  The grammar describes a “language” which is the set of all possible “sentences” which in turn are lists of terminal symbols.  It does this by giving rules for producing parts of sentences.  Any sentence which can be produced following these rules is a valid sentence in the language. All other sentences are invalid.  To avoid confusion, we normally reserve the word “produce” for when a single symbol produces a series of other symbols.  Repeating this multiple times is a “derivation”.  A complete derivation starts with the “START” symbol and results in a list of terminal symbols.

Terminal symbols are the elements of a sentence.  These could be strokes of a pen, letters, words, or conceptual groups of words such as “noun” and “verb”.  For the purposes of computer languages we normally work with “tokens” as terminal symbols where tokens are the things extracted by the scanner described in my previous note.

A grammar takes a sequence of tokens and groups them into semantic units such as expressions and statements.  It essentially describes how to extract structure from  a linear sequence of tokens.

“Tokens” here are little more than opaque objects represented by numbers.  There is often extra detail combined with the number, but the grammar doesn’t care about that.  If the grammar needs to know the difference between two things, they need to be different tokens.  If it doesn’t then maybe they don’t need to be (though it doesn’t hurt).

The grammar describes how to gather sets of tokens (aka terminal symbols) together as what might be called “virtual tokens”.  So the set of tokens “NUMBER PLUS NUMBER” might become a virtual token “EXPRESSION”.  It can then be included in some other grouping of tokens.  These “virtual tokens” are the things referred to earlier as “non-terminal symbols”.

A grammar uses these terminal and non-terminal symbol in the set of “productions”.  Each production declares that some non-terminal symbol may produce a given list of symbols (whether terminal or non-terminal).

This terminology (‘start’ and ‘production’ and ‘terminal symbol’) doesn’t make much sense if you want to parse a program, but it does if you want to generate a program, or “sentence” as described earlier.  From  the perspective if generation, a symbol is “terminal” if it cannot generate anything more, and non-terminal if it can.

When we talk about parsing, we use the same terminology because that is what everyone else does.  I would prefer “real tokens” for terminal symbols, “virtual tokens” for non-terminal symbols, “reductions” for productions and “goal token” for start symbol.  But you gotta work with what you have.

LR parsing basics.

The process described above for generating sentences can process symbols in any order. Once you have more than one non-terminal you have a choice. You can produce something from the first, or from the second.  So you could build up a sentence from the beginning, or from the end.  When parsing you need to be more focused.

The “L” in “LR” means that parsing progresses from Left to right.  You certainly could parse in some different direction, but the language you design for that sort of parsing would end up being very different.  An “LR” grammar converts the leftmost groups of terminals into fewer non-terminals first and then progresses to include more and more terminals moving to the right.

We will get to the “R” shortly.  I suspect that originally “LR” meant “Left to Right”. But there are two distinct ways to parse from Left to Right, one is recursive descent which requires a different sort of grammar to that described here – an LL grammar.  For LR grammars you use a bottom-up shift/reduce parser.

This shift/reduce parser was invented by Donald Knuth and involves a state machine – with a bunch of states -  a stack (of symbols and states) and two actions named “SHIFT” and “REDUCE”. One particular instance of “REDUCE” – the one that involves the START symbol – is sometimes referred to as “ACCEPT”.  I see no value in making such a strong special case of that, so I keep just “SHIFT” and  “REDUCE”.

If we think back to the previous description of deriving a sentence from the start symbol using the grammar, and consider the intermediate stages between “start” and the final sentence, then those lists of symbols (sometimes call “sentential forms”) are represented in the parser by the stack combined with the remainder of the input.  The symbols from the bottom of the stack through to the top of the stack, followed by the “next” symbol to read from input through to the last symbol in the input correspond exactly to the sequence of symbols in some intermediate step when deriving  the program.

The “production” step in sentence generation is exactly mirrored in the “REDUCE” step in parsing.  Specifically if the top symbols on the stack match the symbols in the body of some production, then we might pop them off and replace them with the non-terminal which is the ‘head’ of that production.

The “SHIFT” step simply involves moving one terminal symbol from the input onto the stack.   Parsing proceeds by shifting terminal symbols onto the stack until we have a set of symbols that can usefully be reduced, then reducing them. This continues until the entire program has been read (or “SHIFTed”).  On each “REDUCE” step the parser will record information about the symbols that were popped so as to build up an abstract syntax tree which records the structure of the program.  So each entry on the stack holds not just a symbol and a state, but also some record of what was parsed down to that symbol.

At every “REDUCE” step, the non-terminal that results is the right-most non-terminal in the interim sentence (because it is on the top of the stack, and everything further to the right is terminals still to be read).  So if the series of steps that result from LR parsing were played in reverse as derivation, each step would  involve producing something from the right-most non-terminal.  This is where the “R” in “LR” comes from. The parse is the reverse of a “Rightmost derivation” from the grammar.  This contrast with “LL” grammars where the parse follows a “Leftmost derivation”.

There is an important word above we needs special attention.  It is “usefully” as in “until we have a set of symbols that can usefully be reduced”.  It is important that after we perform the REDUCE step,  the resulting interim sentence can still be derived in the  language.  So we always need to know what symbols are permitted at each stage in the parse.  This is what the “state” of the state machine provides us with.

A “state” really belongs between two symbols.  The “current” state is between the symbol on the top of the stack, and the next symbol to be read from the input.  When we push a symbol onto the stack, we push the current state with it, so that when that symbol is later popped, we can collect the state that preceded it too.

LR parsing step by step

Now that we have the big picture, the best way to understand the details is to watch how the parsing works.

The data that we have are:

  • The grammar, as a set of productions.  Each production has a ‘head’ which is a non-terminal symbol, and a ‘body’ which is zero or more symbols, some may be terminal, others non-terminal.  Each non-terminal may be the head of multiple productions, but must be the head of at least one.
  • An ‘action’ table which is a mapping from a state plus a lookahead to an action.
    The ‘state’ we have already introduced and will be described in more detail shortly. The “lookahead” is the next terminal to be read.  As we need to be able to look at the next symbol, the parser is said to have a look-ahead of ’1′ and is sometimes written “LR(1)”.
    The ‘action’ is either “SHIFT”, or “REDUCE(production)”.  i.e. the “REDUCE” action must identify which production is being reduced.  If the production has “START” as its head, then it is sometimes referred to as “ACCEPT”.  Once the ACCEPT action happens we have parsed a complete sentence and must have finished.
  • A “go to” table.  This is a mapping from state and symbol to a new state.  Whenever we push a symbol, whether due to a SHIFT or a REDUCE, we perform a lookup on the current state for the pushed symbol and that gives us the new state.

Given these data the parsing process is quite straight forward.

  1. Start with an empty stack, with the “0″ state current, and with the first symbol from the input in the look-ahead buffer.
  2. Look in the ‘action’ table for the current state and the symbol in the look-ahead buffer.
  3. If the action is “SHIFT”, push the token and the current state onto the stack, look up the shifted symbol in the go to table for the current state, and save the result as the current state.  Then go back to step 2.  Note that we don’t actually need to store the symbol in the stack, but it can be useful for tracing.
  4. If the action is “REDUCE”, see how many symbols are in the body of the reduced production, and pop that many entries off the stack.  The state stored with the last entry popped becomes the current state.  The entries popped must match the symbols in the production – the states will guarantee that.  This is why we don’t really need to store the symbols.
    Having popped those entries, we do any production-specific processing to build an abstract syntax tree and push the result on to the stack together with the current state (which was updated by the ‘pop’).
    Finally we look up the current state and the head of the reduced production in the go to table and set the new state.  Then if that reduced symbol was ‘start’ we finish, else we go back to step 2.
  5. But what if the action table doesn’t contain an entry for the current state and the symbol in the look-ahead buffer?  That means there is an error.  The symbol is “unexpected”.  We have to invoke the error handling procedure.

Error handling in an LR grammar

The easiest way to handle errors is just to throw our hands up in horror, report the error, and abort.  But that is not always most useful.  The preferred approach is to try to resynchronize with the grammar by throwing away a few symbols.  Often the error will be localized and if we can just get past it we can start doing useful parsing again.

Some writers suggest synthesizing symbols as well as discarding them.  For example you might guess that a semicolon was missing and so insert it.  I don’t believe that idea adds anything useful.  As you will see we do synthesise a single symbol which is the “error” token.  Various error productions can be used to transform this into any nonterminal which may be needed.  So while symbols might be inserted as well as deleted, that is just a side effect of the error productions in the grammar.  The parser mainly focusses on discarding unusable symbols.

At the point where we detect an error there are two obvious symbols that could be discarded.  We could discard the symbol in the look-ahead buffer, or the symbol on the top of stack.  Or both.  Or several in each direction.  Ultimately we want to get to a situation where the state, possibly popped off the stack, has an action for the symbol in the look-ahead buffer.  Choosing what to discard to get there cannot be done reliably without assistance from the language designer.

The language designer should “know” what the key structural elements and syntactic flags are.  Some symbols will occur often and have a strong meaning, so aiming for one of those will be most likely to get back on track properly.  The language designer can provide this information to the parser by specifying “error production”.

“Error” is a special terminal symbol that would not normally be found in the input stream. However it can appear in the body of a production and this makes it an “error production”.

A simple example is:

statement -> ERROR ";"

This says that an error followed by a semicolon can reasonably take the place of a statement. i.e. a semicolon is a strong syntactic symbol that nearly always means the end of a statement, even if there is garbage before it.

With a number of these error productions in the grammar, the task of error handling becomes quite easy.

Firstly the parser pops entries off the stack until the current state has an ACTION for ‘ERROR’.  That mean it is a state which can cope with an error production.  In the above example, it is a state which is expecting a statement.

Then the parser shifts in the ERROR symbol (updating the state with the appropriate ‘go to’)  and then starts discarding input symbols until it finds one for which the current state has an action.  At this point the parser is synchronized again and it can go on to perform the appropriate action as steps 2 and later above.

Building the tables

All that remains is to build the state table and the action table.  These will require manipulating various sets, so you better dig something suitable out of you ADT library.

State and the go to table.

The purpose of the ‘state’ is to be a record of how we got to where we are.  As there are a collection of states on the stack, the current state doesn’t need to be a complete record – a partial record can be sufficient. So as we build a description of each state we may need to throw away information. When we do that we must acknowledge it and be clear why that information can be discard.

“where we are” is at some point in some production.  We have (hopefully) already seen some of the symbols in the body of some production, and we may have some more that we are expecting.  However we could concurrently be at different points in different productions.  This can happen in two ways.

Firstly, and most simply,  if we are in the body of some production just before the non-terminal “X“, then we are simultaneously at the start of every production with “x” as its head.

Secondly, we may not yet know which of several productions we are in.  There could easily be two or more productions that start with some set of symbols, but finish differently.  While we are in that common prefix we need to assume that we are in all of the productions at once, and as more symbols become apparent, the number of possible productions we are in will decrease until it is just one.  That will be about when we REDUCE the production into a non-terminal.

To model these ideas we define a “state” to be a collection of “items”. An item is a particular production combined with an offset in the body of that production.  The offset is sometimes referred to as “DOT” so an item might be presented at:

expression -> expression "+" . term

This is a production which says that an “expression” can produce an “expression” followed by a “+” and then a “term”.  Here the production has been made into an item by placing a “dot” which you can see between the “+” and the “term”.

This item says that we might be in the process of parsing an expression and have seen an expression and a plus sign.  Of course we might be somewhere else where a “+” can follow an expression too, depending on what other items are in the state.

So a state is a collection of these items.  They are all the possible positions in a derivation that we could be in.  For all items in a state, the symbol immediately before DOT will be the same, unless DOT is at the start of the item.  Other symbols might be different.

There are three different sorts of item, each important in their own way.  They are distinguished by what follows the DOT.

A “completed item” has nothing following the DOT.  DOT is at the end of the production.  A “terminal item” has a terminal following the DOT.  A “non-terminal item” has (you guessed it) a non-terminal symbol following the DOT.  See if you can guess how each of these is used?

We find all possible states by a repetitive process which finds all possible “next” states from a given state, and keeps doing this to the known states until no more new states can be found.

The starting point of this search is the state formed from all productions that have the “start” symbol as the head.  For each such production (normally there is only one for simplicity) we form an item with DOT at the start.  This collection of items form the “zero” state, which is the state we start at when parsing.

Given a state we find other states by following a 2-step process.

Firstly we “complete” the state. For every non-terminal item in the state, we find all productions with that non-terminal as the head and add an item for each such production with DOT at the start (if that item isn’t already present). This effects the first way that we can have multiple items in a state.

Once we have added all possible “DOT-at-the-start” items we make a list of all symbols which follow “DOT” in any production.  For each of these symbols we can potentially make a new state (though it is possible the state already exists).

To make a new state given a completed state and a symbol we discard all items for which DOT is not followed by the given symbol, and advance DOT forward one for all items where DOT is followed by the given symbol.  If this state has not already been added to the state list, it must be added, otherwise the pre-existing state with the same set of items must be found.

For each symbol that produces a state from our completed state, we add an entry to the go to table, from the old state via the symbol to the “new” state.

As noted we discard items when creating a new state.  This must be justified.

All the discarded terminal and non-terminal items will still appear in some other state which is created from the current one.  So they aren’t lost,  just elsewhere.  This reflects that fact that as we consume more symbols from the input we narrow down the range of possible parses, so we must naturally narrow the breadth of the state.

The completed items which are discard are different.  No matter which new state we go to, all of those items will be gone, so the history they record of how we got here will be gone too.  However we still have it further up the stack.  If the current state has dot at the end of a production, then the state on top of the stack will have dot before the last symbol in that same production, and the state below that will have dot before the second last symbol, and so now.  Somewhere there will be DOT at the start of that production, and in that state will be recorded the production which lead to this one.  And so up the production chain such that the first state which records the START production will still be on the stack, at the very bottom.  So dropping these doesn’t loose any information as it is still on the stack.

Once we have applied the above 2-step process to all states, and doing it the the last state didn’t create any new state, we have a full set of states, and we have a complete go to table.

Hopefully the process of creating these states makes sense in the context of how the go to table is used during parsing.  A single state records all the productions that could be happening at each point in the parse so far, and the set of states in the stack form a complete record of where we are up to, and what can happen next.

In particular, by examining the current state we can know what actions are possible at each different point. This information is used to construct the “action” table.

The Action table

To produce the action table for a particular state, we start by examining each of the items in that state.  We have already done all that we can with the “non-terminal” items, so it is just the “terminal” and “completed” items that interest us now.

For each terminal item a reasonable action is to SHIFT the next input symbol if that symbol is the symbol after DOT.  So all the symbols that appear after DOT are added to the action table with “SHIFT” as their action.

Note that it isn’t really necessary to store  these in the action table.  The “go to” table for that state has exactly the same set of terminal symbols in it (no more, no less), so we could instead look for the look-ahead symbol in the “go to” table.   If it is present we assume the “SHIFT” action.

For each completed item, a reasonable action would be to REDUCE that particular production.  But we don’t currently know how to store these in the action table – which look-ahead symbol to associate them with.  Also it might be appropriate to REDUCE a production rather than SHIFT a new symbol.

It is in the different ways to address this question that the different flavours of LR grammars become apparent.  So we will address a few of them individually.

LR(0) grammars

LR grammars are sometimes called LR(k) grammars where “k” is the number of symbols that we look-ahead in order to decide the next action.  Most often LR(1) grammar are used.

An LR(0) grammar in one in which no look ahead is done.  For this to work there must be no questions raised by any state.  That is, a state may either have exactly one completed items, or some number of terminal items.  It may never have both and may never have two different completed items, as these would require a decision and we have no look ahead and so no basis to make a decision.

LR(0) grammars are too simple to be really interesting.  They require every production to end with a terminal which doesn’t occur anywhere  but at the end of productions.  So if every production was enclosed in brackets that would work.  However that nearly limits LR(0) to Lisp, and Lisp doesn’t really need a shift/reduce parser at all.

So LR(0) grammars are interesting primarily as a starting point, not for practical use.

LR(0.5) grammars

This is a label that I made up myself, as I think it describes a useful set of grammars but I have not seen it described elsewhere.  Where an LR(0) grammar uses no look-ahead, an LR(0.5) grammar uses one symbol of look-ahead, but only half the time.  Look ahead can guide  the choice whether to SHIFT or REDUCE, but not the choice of which production to reduce.

So for a grammar to be LR(0.5), each state may have at most one completed item in its item set.  When that is the case, all terminal items cause a SHIFT entry for the appropriate terminal to be added to the Action table, and all other possible entries in the Action table are set to REDUCE  the one completed item.  If there is no completed item then all other entries in the Action table lead to error handling.

As before, you wouldn’t actually bother storing this Action table. You would just record the one REDUCE action as a default.  Then if the look-ahead symbol is present in the go to table, the parser performs SHIFT, otherwise it performs the default REDUCE or triggers an error.

I have seen it suggested that a weakness of this approach is that it doesn’t detect an error at the earliest possible moment.  We may have to wait for several reductions to happen until we hit a state with no possible reductions, and which does not have a ‘go to’ for the current look-ahead token.  Again, I don’t really understand this logic – those reductions don’t consume any input symbols so the error is still detected at the correct place in the input text. But maybe I’ll understand better once I’ve had some more experience, so stay tuned.

In my experiments with a grammar for Ocean, I’ve found that I never have two completed items in the one state.  So a grammar such as this could well be sufficient to parse the language I will end up with.

This highlights as important aspect of my focus here.  I don’t particularly need a grammar parser that is suitable for any language I get handed.  I just need a parser which will handle the language that I will develop.  And I plan to develop a language with a very simple and regular syntax – one that is easy enough for a mere human to understand.  So I don’t want a lot of complexity.  So while LR(0.5) may not be good enough for a lot of use cases, it could still be good enough for me.

However it isn’t quite.  Not because my language won’t be simple, but because LR(0.5) cannot tell me exactly how simple my language is.

This is a bit like the fact that a lot of syntax in programming language isn’t needed for correct programs.  It is only needed for incorrect programs to help the programmer see exactly where it is incorrect.  Similarly if I get my grammar right, then LR(0.5) should be enough.  But if I get it wrong, I may not know if I only do an LR(0.5) analysis of it.

So I need more power in my grammar analysis and so must move on.

Precedence LR Grammars

This is a bit of an aside as Precedence grammars don’t help detect errors and inconsistencies in a grammar, they only help to fix them.  However they are useful to understand and as this is where they fit in the progression of complexity, this is where I will discuss them.

The idea with a precedence grammar is that the designer can make extra statements about different productions, giving them different precedence.  This can apply in two different ways.

Firstly it can be used to choose between two completed items in a state.  If we are trying to treat a grammar as LR(0.5) and find that some state has two completed items, we can “fix” the grammar by specifying a precedence of the various productions.  I’d like to give an example here but it would end up being a toy example of a completely useless grammar and so wouldn’t really help. I personally cannot see why you would ever want a grammar which had two completed items in the same state.  It means that some sequence of input token could be treated as one thing or as another depending only on what comes next.  That sounds a like a design mistake to me.  Maybe I’ll eat my words later, but for now this means I cannot find a useful example – sorry.

The other way that precedence can be useful is much easier to explain.  It concerns the choice between SHIFT and REDUCE.  As noted above, an LR(0.5) will always SHIFT when it can, and only reduce when it cannot.  That might not be what you want.

Here there is a classic example: expression parsing.  Consider this grammar.

expression -> expression "+" expression
            | expression "*" expression

A state in this grammar could have an item containing

expression -> expression "*" expression .
expression -> expression . "+" expression

That is, we have parsed an expression, a multiplication symbol and another  expression, and how we are faced with a choice of REDUCING the multiplication to a single expression, or SHIFTing in a plus sign.

The mathematically standard thing to do would be to REDUCE as multiplication has higher precedence than addition.  However the LR(0.5) grammar parser would shift by preference which would produce the “wrong” result.  So this grammar is “wrong”, but is easily fixed by giving the multiplication production a higher precedence than the addition production.  The grammar processing step would then notice that a REDUCE step has a higher precedence than the SHIFT, and would remove the SHIFTed symbol from the “go to” table.

In order to make repeated use of the same operations (e.g. “a + b + c”) group correctly (as “(a + b) + c”) we would also want to reduce an addition before shifting another plus sign. So a production in a “completed item” might need a higher precedence than the same production in a “terminal item”.  To get a completely general solution, we required each precedence level to be left-associative or right-associative.  If the former, then reduce beats shift.  If the latter then shift beat reduce.

These precedences can be useful, but they aren’t really required.  The same effect can be achieved with more non-terminals. For example

expression -> expression "+" term
            | term
term -> term "*" factor
      | factor

By giving different names to things that are added and things that are multiplied, and by putting different names on either side of the operations, the precedence and associativity can be specified implicitly and with complete generality.

Which of these forms you prefer is very much a subjective issue.  There is little practical difference, so it is best to choose the one that makes most sense.  I think I prefer the latter, though the former is briefer and that is not insignificant.

The one practical difference that exists is that the second form involves more REDUCE steps during parsing.  Reducing

term -> factor

doesn’t really do much, but is a step which is completely unnecessary in the first grammar which needed precedence to disambiguate.  So if you find your parser spending too much time reducing trivial productions, it might be useful to re-write it in a form that doesn’t require them.

SLR(1) Grammars

So far we have looked at what we can do with just the “go to” tables and some simple conventions or annotations.  But it is possible to extract more details from the grammar itself and see if we can resolve some potential conflicts automatically.

The first of these is call “SLR” for “Simple Left-Right” grammar.

SLR(1) adds the computation of “FOLLOW” sets.  For each non-terminal “s”, “FOLLOW(s)” is the set of all terminals which could possible follow “s” anywhere in any sentential form derived by the grammar.  This is then used to build the ACTION tables.

When we have a completed item, we look at the symbol in the head of that item, find the FOLLOW set for that symbol, and for every terminal in the FOLLOW set, add an Action to REDUCE that production if the given terminal is the look-ahead symbol.

This will often clearly identify which symbols should trigger a reduction (and which reduction to trigger), and which symbols should be shifted.  Consequently it also identifies which symbols are in error.

If there is any over-lap between the FOLLOW sets of reducible productions or between a FOLLOW set and the SHIFTable terminals, then the grammar is said to not be SLR(1).  If there is no conflict, the grammar is fine and we have a well defined action table.

Computing the FOLLOW sets requires three simple steps.

Firstly we flag all the symbols that can produce an empty string.  Normally relatively few symbols can do this, though any time you have a list that could be empty, you will need a “nullable” symbol (as these are called).  These are found by an iterative process.

Firstly we flag all the non-terminals which are the heads of productions with no symbols on the right hand side.  Then we flag all non-terminal were all the symbols on the right hand side are flagged.  If there were any, then we repeat the process again and again until we cannot flag any more non-terminals as nullable.  Then we can stop.

Having done this, we now proceed to the second step which is to build a FIRST set for every non-terminal.  You can include terminals as well if it makes it easier, however the FIRST set for a terminal contains simply that terminal, so it is quite easy to make a special case of terminals.

“FIRST(s)” is the set of all terminal which could be the “first” terminal in an expansion of “s”.  We compute this using a similar iterative approach to calculating “nullable”.

We walk through all the productions in the grammar and add to FIRST() of the head symbol, the FIRST set of every symbol in the body which is preceded only by nullable symbols.  So the FIRST set of the first symbol on the RHS is always added, and the first set of the second symbol is too if the first symbol was flagged as being nullable.  And so on.

We continue this process checking each production again and again until no more changes in the FIRST sets can be made.  Then we stop.

Once we have the nullable flag and the FIRST sets we can proceed to the third step which is to create the “FOLLOW” sets.  This again follows a pattern of repeating a partial calculation until no more change can be made.

To find “FOLLOW(s)” we find every occurrence of “s” in the body of any production, and add FIRST(t) where “t” is the symbol after “s”, or any symbol following “s”but only separated by nullable symbols.  If all of the symbols following “s” are nullable, then we also add FOLLOW of the head of the production.

Adding the FIRST of following symbols only needs to be done once, as FIRST is stable and won’t change any more.  Adding FOLLOW of the head might need to be done repeatedly as other changes to FOLLOW sets might need to flow back to this one.

Again, once we don’t find any more changes we can stop.  At this point we have complete FOLLOW set and can determine if the grammar is SLR(1), and if so: what the action table is.  If the grammar is not SLR(1), it might be fixable with precedence annotations.

LALR(1) grammars

SLR(1) grammars are pretty general but not completely so.  If a non-terminal is used in two quite different parts of the grammar then it is possible that it is followed by different symbols in different places.  This can result in it reporting conflicts that won’t actually happen in practice.  This is because we have been too general in generating the FOLLOW sets, assuming that the FOLLOW set of any symbol is the same in every state.  Often it is, but not always.

If we are going to bother generating FOLLOW sets, and particularly if we are doing it for the purpose of checking if there are any unexpected possible inconsistencies in the grammar, then we may as well do it properly.  Doing it properly is called “LALR(1)”, where “LA” stands for “LOOK AHEAD”.

If you look in the literature you can find lots of different ways of calculating the LALR LOOK-AHEAD sets.  Many of these start with the states we have from the LR(0) calculation and go from there.  I don’t really like those – I find them confusing and they don’t really make it clear what is happening (without lots of head scratching).  So I’m presenting my way of doing it.  I didn’t find this in the literature, but maybe that is because I didn’t understand all that I read.  I’m sure someone did come up with it before me.

This procedure requires the FIRST sets and the ‘nullable’ flag developed for the SLR grammars, so we will keep them but discard the FOLLOW sets.  In place we will generate lookahead or LA sets, one for each item in each state. So LA(s,i) is a set of terminals which we expect to see after the item ‘i’ is reduced in or following state ‘s’.

If we go back to the description of items and state and go to table it is not hard to see in retrospect that we lost some information, at least if we were considering an LR(1) style grammar.  We didn’t include the look-ahead symbols in our description of state, but now we must.

So: each item in a state is not only a production and an offset in the body of that production.  It must now also contain a set of terminals which can follow after that production (i.e. after the body or, equivalently, after the head).  For the items based on the “start” symbol that are used to populate the first state (state zero), the look-ahead set contain only the special “end-of-file” terminal.

When we complete a state by adding an item for each production from each non-terminal which follows “DOT”, we calculate the LA set for that item in much the same way that we calculated FOLLOW.  We add together the “FIRST” sets for all subsequent symbols which are separated from the target non-terminal only by nullable symbols, and with use this sum of FIRST sets as the LA set of each item we make from that target non-terminal.

If all of the symbols following a non-terminal are nullable (or if there are no symbols following it), we need to add in the LA set for the source item.

So if an state contains the item:

A -> B C . D E {L}

Where “{L}” is the look-ahead set, then when we add an item based on a production with “D” in the head, the look-ahead set must be set to FIRST(E), and if E is nullable, this must be unioned with “L”, so

D -> X Y Z { FIRST(E) + L}

When we create a new state using selected items  from an old state, the LA sets simple get copied across with them.

Now we come to the interesting part. If you look back at how we generate state before you will recall that each time we generated a new state we had to check if it was in fact new, or if it has been seen before.  If it has been seen before we just reused the old state.

Now that we are carrying extra information in a state we need to know what to do with that.  There are two options.

Firstly, we can decide that the look-ahead set is an important part of the state (which it is) and so if the look-ahead sets are different, we have to create a new state.  This will mean that, in general, the LALR process will create more states than the LR(0) or SLR(1) processes.  Often many many more.  If you choose this approach you actually end up with “canonical LR(1)” which is described next.  So let’s not get ahead of ourselves.

If you decide that while important, the look ahead sets shouldn’t justify making new state, then you can instead update the old state by adding in (i.e. set-union) the LA sets of the new state with those of the old state.  This loses a little bit of information and so there can be grammar which will end up with conflicts after the LALR process, but which will work fine with canonical LR(1).  But in many cases there won’t be conflicts, so LALR is enough.

As we have updated the LA sets of a state, and as the LA sets can propagate through to other states, we need to re-process every state where we changed the LA set.  This can result if processed each state twice or even more on average.  It appears to be only a constant increase in effort, so should not be too burdensome.

After we have generated all the new states, and propagated all LA set changes as far as they will go, we have complete LALR(1) LR sets from which we can form action tables.

Again, these might result in completely unique actions for each terminal, or might result in conflicts.  If the former, we say the grammar is LALR(1). If the latter, it is not.

LALR(1) usually provides enough power without too great a cost and seems to be the best trade off.  As noted I am most interested in LALR not for the high quality action tables it produces (I don’t want to use them) but for the fact that it will calculate the LA sets and tell me if there are any cases were both a SHIFT and a REDUCE might be appropriate, and can provide extra information to tell me why a state might have to different possible REDUCE actions.  This information will, I think, be invaluable in understanding the grammar as I am creating it.

Canonical LR(1) grammars

The “canonical LR(1)” style of grammar is the original one described by Knuth before others found ways to simplify it and make it more practical.  For LR(1) “usefully” (as in “until we have a set of symbols that can usefully be reduced”) requires a lot more states that “LR(0)” and all the intermediate forms.  These state reflect the various different possibilities of look-ahead and generally aren’t very interesting.

There certainly are grammars which are LR(1) but not LALR(1), but I don’t think they are very interesting grammars.  Still they might be useful to some people.

The only conflicts that LALR(1) will report but with LR(1) will know are false conflict a REDUCE-REDUCE conflicts.  i.e. two different reducible productions in the same state.  As I have already said I think this is a design mistake so I have no use at all for LR(1).  If LALR(1) reports a conflict, then I think it is an error even if it is not really a conflict.

To construct the states for LR(1) we follow the same approach as LALR, including an LA set with each item in each state.  However when we look to see if a new state already exists, we require that both the core items and their LA sets are identical.  If they are, then it is the same LR(1) state.  If not,  we need a new state.  No merging of LA sets happens, so we don’t need to repeatedly process each state.

This produces more states.  I have one simple grammar which results in 27 LALR(1) states and 55 LR(1) states.  I have another I’m playing with which currently has 174 LALR(1) states and 626 LR(1) states.

An Implementation.

Theory is never really interesting without practical implementation to show how it works. One of the great appeals of computer programming is that creating practical implementations is quiet easy (unlike economics for example, where experimenting with national economies requires way too much politics!)

So I have written a parser generator that realises all this theory.  It can use any of the different analysis approaches listed above.  The generated parser only uses LR(0.5) though.  I don’t bother building a list of reducible productions and choosing between them based on look-ahead as I don’t think this will ever actually be useful.

Also the generator does not yet handle precedence annotations.  The annotations can be given by they are ignored.  I simply ran out of time and wanted to get this published before yet another month disappeared.  I hope to add the precedence handling later.

The link above is to the version as it is today.  I have other plans to enhance it as will be seen in my next post but I wanted the code linked here to represent only the ideas listed here.

If you want to try this out, you can:

git clone git://ocean-lang.org/ocean
cd ocean/csrc
make; make
./parsergen --LALR calc.cgm
./parsergen --LR1 calc.cgm

This will print a report for the grammar showing all of the states complete with items and LA sets.  This might help with following the above explanation which is, a some points, a little dense.

Note that if you want to create your own grammar, you need a “markdown” formatted file prepared for `mdcode` processing.  Hopefully `calc.cgm` is a usable example.

 

Lexical Structure

It is traditional to divide the parsing of a program into two conceptually separate stages – the lexical analysis which extracts a stream of tokens from a stream of characters, and the grammar matching stage which gathers those tokens into grammatical units such a declarations, statements, and expressions.  Recognising the value of tradition, Ocean will similarly have separate lexical and  grammatical stages.  This note will focus on the lexical stage.

Future language design decisions will refine may details of the lexical structure of the language, however there is a lot of commonality among current languages and we can build on that commonality to establish a base approach to lexical analysis which can then be fine tuned when the details of the language are known.

This commonality suggests that the grammars of programming languages are comprised of identifiers, reserved words, special symbols, numeric literals, and string literals. Comments need to be recognised but don’t form tokens for the grammar.  White space is mostly treated  like comments – recognised but not tokenised.  However line breaks and leading indents sometimes are significant in different ways, so we need to decide whether and how they might be significant in Ocean.

Identifiers

Identifiers are mostly strings of letters and digits, though they don’t start with a digit. Sometimes other characters are allowed.  This pattern is most easily summarised as a set of characters that can start an identifier and another set of characters that can continue an identifier.  Fortunately Unicode provides well defined character classes “ID_Start” and “ID_Continue” which are suitable for exactly this purpose, providing we base the language on Unicode.

While ASCII was the old standard and is still very useful, Unicode does seem to be the way of the future and it would be foolish to ignore that in the design of a new language.  So an Ocean program will be written in Unicode, probably in the UTF-8 encoding (though that is an implementation detail) and will use ID_Start and ID_Continue as the basis for identifiers.

Beyond that it may be desirable to add to these character sets.  ID_Start does not contain the underscore which many languages do allow.  Neither contain the dollar sign which, for example, Javascript allows in identifiers.

At this stage it seems best to decide that the lexical analysis engine should take a list of “start” characters and a list of “continue” characters to be added to ID_Start and ID_Continue, which can then be used to create identifiers.  ”start” will contain at least the underscore.  The fate of ‘$’ and any other character will have to await future language decisions.

Reserved Words

Reserved words are a specific subset of the class of words that can be used as identifiers. They are reserved for grammatical use, such as “if” and “while” and “else”.  We cannot know at this stage what the list will be, just that there will be a list which will need to be provided to the lexical tokenizer.  These words must only contain valid identifier characters.

Special Symbols

These are like reserved words, but they are made up from characters other than those used in identifiers (or those which mark numbers and  strings which come later).  One could imagine having tokens which combine identifer-characters and non-identifier characters, such as “+A”, but that seems like a bad idea.  It is more natural to see that as two tokens “+” and “A”, and to let the grammar join them together as needed.

There is an important difference between reserved words and special symbols.  Words that aren’t reserved can still be valid identifiers, while symbols that aren’t special are probably illegal. So if “while” is a reserved word but “whilenot” isn’t, then the later is simply an unrelated identifier. However if “=” is a special symbol and “=/” isn’t, then the later is not valid as a single token at all, and we must parse it as two tokens “=” and  ”/”.

So for identifiers and reserved words, we take the longest string of characters that is valid for an identifier and then decide if it is actually a reserved word.  For special symbols we take the longest string of characters which is a valid symbol and stop the current token as soon as we see a character which would not result in a valid token.  Thus any prefix of a special symbol must also be a special symbol.  This should not be a hardship as longer symbols are only chosen when the length is needed to distinguish from something shorter.

So together with the list of reserved words, the lexical engine will need a list of special symbols.  These will probably be ASCII rather  than full Unicode.  While permitting Unicode is important, requiring it would probably be inconvenient.

Numeric literals

There are two questions we need to resolve for numeric literals: what character strings can be used, and what do they mean.  We only need to answer the first here, but it might be useful to dwell a little on the second too.

I really like the idea in the Go language that numeric literals don’t carry strong type information.  They are primarily a value and a value can be used where ever the value fits. So “1e9″ and “4.0″ might be integers, and  ”34″ might be a float.  I intend to copy this idea in Ocean and possibly carry it even further so that “34″ and “34.0″ are completely indistinguishable to the rest of the language (Go can see a difference in a some cases).

Given my principle that any feature available to built-in types should also be available to user-defined types, it follows that if a numeric literal might be useful as an integer or a float, it should also be usable for any other type.  An important example is the “complex” type which might benefit from literal constants.  To express these fully we need to distinguish the “real” from the “imaginary” components and this is typically done using a trailing ‘i’ or ‘j’.

Generalising this suggests that a numeric constant should allow a short suffix which can be used by user-defined types in arbitrary ways.  The exact details of this will have to be left for a later stage of the language design.  For now, numeric constants will be allowed 1 or 2 following letters, which as yet have an unspecific meaning.

There is plenty of  precedent to follow for the numeric part:

  • “0x” means that the following digits are is hexadecimal.
  • “0b” means following digits are binary.
  • A leading “0″ implies octal, unless there is a decimal pointer or exponent.
  • An exponent is “e” or “E” followed by a decimal number which multiplies the preceding number by ten to the given power.
  • A “decimal point” can be use to specify a floating point number.

Not all of this precedent is necessarily useful though.  In particular the leading “o” meaning “octal” is an irregularity in the design as it is really “a leading 0 followed only by digits”.  This usage is also error prone as regular mathematics ignores leading zeros.  So Ocean will follow Python 3 and other languages and use “0o” for octal, making a leading zero only legal when followed by an ‘x’, ‘o’, b’, or ‘.’.

Other possibilities don’t have such strong precedents but are still useful.  For example, “ox” can also introduce a floating pointer number of a hexadecimal point, or a “pNN” suffix gives a power of 2.  There is also the question of whether a decimal point can start a number, or whether a digit is required.

Another possibility to add to the mix is that as long strings of digits can be hard to read it is sometimes permitted to insert underscores to aid readability, much as spaces are often used when writing large numbers in natural-language text.  Some languages allow the underscores anywhere, others impose stricter requirements.  For example in Ceylon a decimal number can only have them every 3 digits, and in hexadecimal or binary they can be every 2 or ever 4 digits.

Finally, some languages allow negative literals, whether by including a leading “-” in the number token, or by allowing a leading “_” to mean a negative literal.

To meet the goal of being expressive, Ocean should allow as many different ways to encode numbers as could be useful.  To meet the goal of minimising errors, it should exclude possibilities that could be confusing and don’t actually add expressivity.

So Ocean will allow “0x” hexadecimal numbers, “0o” octal numbers, and “0b” binary numbers, with or without a “decimal” point and a “p” exponent, and decimal numbers with or with a decimal point and an “e” exponent.  A decimal can only start with a zero if that is immediately followed by a decimal point and in that case the zero is required: a number cannot start with a point.  The  ”p” or “e” exponent marker can also be “P” or “E”, and the number following it is in decimal, with no leading zero, but possibly with a leading + or -.

As this is an experiment as much as a language I will try allowing the use of a comma for the decimal marker as well as a period.  This means that commas used in lists might need to be followed by a space.  Requiring that is a bit ugly, but actually doing it is always a good idea.  If this turns out to be problematic it can be changed.

All of these can have single underscores appearing between any two digits (including hexadecimal digits).  As another experiment, spaces will be allowed to separate digits.  Only a single space (as with underscores) and only between two digits.  Though this might seem radical it is less of a problem than commas as numbers are almost always followed by punctuation in a grammar.

All numeric literals are non-negative – the grammar will allow a prefix negation operator which will be evaluated at compile time.

This numeric part will be evaluated precisely such as can be achieved with the GNU “libgmp” bignum library.  Conversion to machine format will only happen at the last possible point in language processing.

String Literals

String literal are simply arbitrary sequences of characters surrounded by some quote character.  Or at least, almost arbitrary.

It can improve language safety to not allow literal newline characters inside string literals. Thus a missing close quote is easily recognised.  When multi-line string constants are required, one effective approach is the triple-quote used by Python and other languages. This is common and well understood, so Ocean will borrow the idea.

Inside a string, escapes may or may not be permitted.  These are  typically a back slash (“\”) followed by one or more characters.  The minimum requires would be “\n” for a newline, “\t” for a tab, and “\ooo” with octal digits for an arbitrary ASCII character.  We could go to extremes and allow “\U{Name of Unicode Character}” which allows the use of long Unicode character names.

This is an area where extension might be useful so it is important that any construct that might be useful in the future should be illegal at first.  So all escapes not explicitly permitted are in error.

One particular feature of strings that I personally have never liked is that to include a quote inside a quoted string, you escape the quote.  This means that the simple rule “characters between quotes make a string” doesn’t apply and any parsing engine must recognise the escape character.  As this is my language I will chose that quotes cannot be escaped.  Instead Ocean will allow “\q” which will expand to whichever quote character starts and ends the string.

The escape sequences understood will initially be:

  • “\\” – the “backslash” escape character.
  • “\n” – newline
  • “\r” – return
  • “\t” – tab
  • “\b” – backspace
  • “\q” – quote
  • “\f” – formfeed
  • “\v” – vertical tab
  • “\a” – alert, bell.
  • “\NNN” – precisely 3 octal digits for an ASCII character (0-377)
  • “\xNN” – precisely 2 hexadecimal digits
  • “\uXXXX” – precisely 4 hexadecimal digits for a Unicode codepoint.
  • “\UXXXXXXXX”- precisely 8 hexadecimal digits for a Unicode codepoint.

There are three quote characters that are sometimes used: single quote, double quote, and back quote.  Ocean will almost certainly allow all three.  The lexical analysis code will certainly recognise any of the three that haven’t been claimed as a special symbol.  Though I’m leaving this open at present, I expect that strings surrounded by backquotes will not have any escape processing done, while the other strings will.

If two quote symbols appear together, that is simply an empty string.  If three appear, then it is a multi-line string.  The next character must be a newline and the closing triple quote must appear on a line by itself.  The newline after the first triple quote is not included in the string, the newline before the second triple quote is.

Quoting is permitted inside multi-line strings in the same way as with single-line strings, so triple-backquote should be used to avoid all escape interpretation.  In a multi-line string that allow escapes, an escape character  at the end of the line hides that newline character.  This is the only way to achieve a multiline string that does not end with a newline character.

With multi-line strings comes the question of indenting.  It would be nice to be able to indent the whole string to line up with surrounding code, and to have the indent stripped when the parsed token is turned in to a string constant.  But the key question is : how exactly to specify what to strip.

One option would be to strip the same indent as was on the line that contains the start of the token.  However it is nice for values to be more indented than – say – the assignment that uses them.

Another option would be to strip the same indent as was on the line that contains the end of the token.  This places the programmer in more control and probably makes lots of sense.  It would only be safe to strip the exact sequence of characters that precede the close quote.  So if there is a mix of tabs and spaces, it must be the same mix for every line.

I’m not 100% comfortable with this, but it is certainly worth a try, so it will be how the first  attempt works.

Comments

There are three popular styles of comments:

  • /* Block comments from C and others */
  • // single-line comments from C++ and BCPL
  • # single-line comments from C-shell and Bourne shell.

While others exist, these seem to be most popular, and others don’t add any extra functionality.  For the block comment, it is important to note that nesting is not permitted.  That is: the character pair “/*” may not appear inside a comment.  Allowing it does not significantly improve expressivity, and excluding it guards against forgetting or mis-typing the closing sequence “*/”.

It seems best to simply allow all of these forms of comments.  The only negative impact this might have is that it means the 3 introductory notations could not be used elsewhere, and of these, the “#” is the only symbol with significant uses in existing languages.  Where it is used, it is often for extra-language functions such as macro pre-processing. Features provided by these extra-language notations are probably best included in the main language in some way.  So at this stage, the plan for Ocean is to treat all three of these forms as comments.

Line breaks and indents

The use of context-free grammars to describe languages led to line breaks being largely ignore as syntactic elements.  Rather a program is seen as a linear sentence in the language, and spacing it just used to improve readability.

However this approach allows the possibility of the visual formatting becoming out-of-sync with the actual structure of the program.  Further it results in syntactic elements, particularly semicolons, which are required by the language but appear unnecessary to the human reader.

Various approaches to address this have been tried in different languages.  Of these I find that in Python the nicest.  It may not be perfect, but recognising both the presence of line breaks and the level of indents seems valuable.

We will explore this issue more deeply when we look at using a grammar to parse the token sequence, but for now we need to know what tokens to include, and what they mean.

The key observation needed at this stage is that when a block of text is indented, it is somehow subordinate to the preceding line that was not indented.  Thus that first line doesn’t completely end until all the indented text has been processed.

So every line break will generate a “newline” token, however if subsequent lines are indented, the “newline” token will be delayed until just before the next line which is not indented.  Together with this, “Indent” tokens will be generated at the start of each line which is indented more than the previous line, with matching “Undent” tokens when that indent level terminates.  In consequence of this, indenting can be used to bracket syntactic objects, and to whatever extent newlines terminate anything, they only do so after any indented continuation is complete.

So for example

if A
   and B {
       X
       Y
       while C do
            Z
}

would parse as

if A INDENT and B { INDENT X NEWLINE Y NEWLINE while C do INDENT Z
NEWLINE UNDENT NEWLINE UNDENT NEWLINE UNDENT NEWLINE } NEWLINE

Note that INDENT and UNDENT come in pairs, and the NEWLINE you would expect before an INDENT comes after the matching UNDENT.  This will allow a multiple UNDENT, such as that following the “Z” to terminate a number of syntactic units.

Interaction with literate programming

As previously discussed an Ocean program is, at least potentially, presented as a literate program which may have several code blocks from various parts of the document.  The above discussion of lexical analysis has assumed a simple sequence of characters, while in actual fact with have a sequence of blocks of characters.  Does that make a difference?

As blocks are always terminated by newlines, and as indenting is carefully preserved across blocks the only issues that could arise would be with lexical elements which are contain newlines, but aren’t newlines themselves.  Of these there are two.

Firstly, block comments can contain newlines.  Therefore, is it acceptable for a comment that starts in one code block to be continued in another?  Almost certainly this would be a mistake and would lead to confusion.  It is possible that it might be useful to “comment out” a block of code, however it that is important, then the language should have more focused means to achieve the same without confusion.   Our literate programming syntax make it possible to mark a block as an “Example” to not be included in the code, and a section reference line can easily be commented out with a single-line comment.  So comments must not span literate program code blocks.

Secondly, multi-line strings can contain newlines.  For much the same reasons as with comments, these strings will not be permitted to span literate program code block.  The fact that we strip the indent of the closing quote from all lines makes interaction with literate programming much cleaner than leaving all the indentation in.

Trying it out.

Having lots of ideas without any code isn’t much fun.  So to accompany this blog is my first release of a token scanner – written as a literate program of course.

It is somewhat configurable so that if I change my mind about spaces in numbers they are easy to remove.

This program is one of the reasons that this post took so long to arrive.  I had written a scanner prior to staring in the post, but wanted to re-write it as a literate program and wanted to incorporate ideas I had gained since writing it.  This took quite a while as I’m still new the the literate programming approach and my first draft had a rather confused structure.  However I’m really happy with the result and I think the code and design are better for the extra effort put in.

Literate programming?

I was going to start out by describing the lexical structure of Ocean, but as I thought more about it, I realised there was something that needed to come first.  And that something is literate programming.

Literate programming is the idea – popularised by Donald Knuth – of writing programs like pieces of literature.  They can be structured as a story or an exposition, and presented in a way that makes sense to the human rather than just to a computer.

The original implementation – known as “web” with two tools, weave and tangle – took a source document and converted either to TeX for elegant printing, or to Pascal for execution.  So the “web” input was not read directly by the compiler.

I first came across the idea of the language being prepared for literate programming with the language “Miranda”.  A Miranda program can come in a form where most text is treated as comments, and only suitably marked pieces of text (with a “>” at the start of the line) is treated as code.  I like this idea – the language should acknowledge the possibility of literate programming from the start.  I want Ocean to acknowledge it too.

Obviously a language cannot enforce a literate style, and I don’t think I would want it to try.  I am a big believer in “write one to throw away”.  The first time you write code, you probably don’t know what you are doing.  You should consider the first attempt to be a prototype or first draft, and be ready to throw it away.  There probably isn’t a lot of point trying to weave a prototype into a story.

But once you have working code and have got the algorithms and data structures sorted out in your mind, I think there is a lot of value in re-writing from scratch and presenting the whole more clearly – as a piece of literature.  So I see two sorts of programs: prototypes that just have occasional comments, and presentation code which is written in full literate style.  I find that having an audience to write to – even an imaginary one – motivates me to write coherently and close the gaps.  Presenting code as a piece of literature may equally motivate good quality.

Having two different forms at first suggested to me the idea of two different file suffixes to maintain a clear different, but I no longer lean that way. It might be nice to allow two different suffixes so a programmer could create “hello.ocd” – the “OCrean Draft” first, and then create “hello.ocn” the final program as a separate file, but having the compiler impose different rules on the two files would likely make programmers grumpy, or at least confused.  So as long as there is a simple syntax marker at the top of the file which effectively says “this is all code”, that should be all that the compiler needs to care about.

The  key elements of literate programming are to have a language for presenting the text, a way to extract the code from among that text, and some sort of cross-reference mechanism for the code.

This last is important because presenting code as a story doesn’t always follow a simple linear sequence – it is good to be able to jump around a bit.  We might want to tell the overall story of a function, then come back and explain some detail.  The extra detailed code should appear with the extra detailed explanation.  It may be possible to use language features to do this, but at this stage of the design I am far from certain.  So I want the literate-programming model to allow rearrangement of code through cross references.

While TeX has many advantages for presenting text and produced beautiful output, I do find it a bit clumsy to use.  It is maybe too powerful and a little bit arcane.  A popular language for simple markup currently is “markdown”.  The input is quite easy to read as plain text, and there are plenty of tools to convert it into HTML which, while not particularly elegant, is widely supported and can be tuned with style sheets.

“markdown” has the added advantage of have a concept of “code blocks” already built in so it should be easy to extract code.  Using “markdown” for literate programming is not at all a new idea.  A quick search of the Internet finds quite a few projects for extracting executable code from a markdown document.  Several of them allow for cross references and non-linear code extraction, but unfortunately there is nothing that looks much like a standard.

This lack of a standard seems to be “par for the course” for markdown.  There is some base common functionality, but each implementation does things a bit differently.  One of these things is how code sections are recognised.

The common core is that a new paragraph (after a blank line)  which is indented 4 or more spaces is code.  However if the previous paragraph was part of a list, then 4-spaces means a continuation of that list and 8 spaces are needed for code.   In the Perl implementation of markdown, lists can nest so in a list-with-a-list, 12 spaces are needed for code.

Then there are other implementations like “github markdown” where a paragraph starting with ``` is code, and it continues to the next ```.  And pandoc is similar, but code paragraphs start with 3 or more ~~~ and end with at least as many again.

On the whole it is a mess.  The github and pandoc markings are probably unique enough that accepting them won’t conflict with other standards.  However while the Perl markdown recognises nested lists, the python markdown doesn’t.  So code extracted from markdown written for one could be different from markdown written for the other.

But if I want Ocean language tools to be able to work with markdown literate programs (and I do) then I need to make some determination on how to recognise code.  So:

  • A paragraph starting ``` or ~~~ starts code, which continues to a matching line
  • An indented paragraph after a list paragraph is also a list paragraph.
  • An indented (4 or more spaces) paragraph after a non-list paragraph is code

This means that code inside lists will be ignored.  I think this is safest – for now at least.  If experience suggests this is a problem, I can always change it.

This just leaves the need for cross-references.  Of the examples I’ve looked at, the one I like that best is to use section headings as keys.  markdown can have arbitrarily deep section headings so a level 6 or 7 section heading could cheaply be placed on each piece of code if needed.  So I’ll decide that all code blocks in a section are concatenated and labelled for that section.

While cross references are very valuable, it can sometimes be easier to allow simple linear concatenation.  An obvious example is function declarations.  Each might go in a separate section, but then having some block that explicitly lists all of those sections would be boring.  If anything, such a “table of contents” block should be automatically generated, not required as input.  So it will be useful if two sections with the same, or similar, names cause the code in those two (or more) sections to be simply concatenated.

It is unclear at this stage whether it should be possible to simply reuse and section heading, or whether some sort of marker such as (cont.) should be required.  The later seems elegant but might be unnecessary.  So in the first instance at least we will leave such a requirement out.  Section names for code can be freely reused.

Inside a code block we need a way to reference other code blocks.  My current thinking is that giving the section name preceded by two hash characters (##)  would be reasonable.  They are mnemonic of the section heading they point to, and are unlikely to be required at the start of a line in real code.

These references are clearly not new section headings themselves as they will be indented or otherwise clearly in a code block, where as section headings are not indented and definitely not in code blocks.

So this is what literate Ocean code will look like: markdown with code blocks which can contain references to other code blocks by giving the section name preceded by a pair  of hashes.

A non-literate Ocean file is simply one which starts with ``` or ~~~ and never has a recurrence of that string.

To make this full concrete, and as a first step towards working code in which to experiment with Ocean, I have implemented (yet another) program to extract code from markdown.  Following my own advice I wrote a prototype first (in C of course, because I can’t write things in Ocean yet) and then re-wrote it as a literate program.

I found that the process more than met my expectations.  Re-writing motivated me to clean up various loose ends and to create more meaningful data structures and in general produced a much better result than the first prototype.  Forcing myself to start again from scratch made it easier to discard things that were only good in exchange for things that were better.

So it seems likely that all the code I publish for this experiment with be in literate style using markdown.  It will require my “md2c” program to compile, until Ocean is actually usable (if ever) in which case it will just need to Ocean interpreter/compiler which will read the markdown code directly.

I find that my key design criteria of enhancing expression and reducing errors certainly argue in favour of the literate style, as the literate version of my first program is both easier to understand and more correct than the original.

This program can be read here or downloaded from my “Ocean” git repository at git://ocean-lang.org/ocean/.

The naming of a language

“The naming of cats is a difficult matter, It isn’t just one of your holiday games.”  The naming of programming languages is also important.  As with any project a name is needed to be able to refer to, and it inevitably will set expectations and flavour to some degree.

I’ve had a few different thoughts about names.  My first idea was “plato”.  Plato was a philosopher and is particularly known for drawing a distinction between the real and  the ideal.  All things in the real world, circles and squares and so forth, are just poor shadows of the perfect circles and the perfect squares that exist in the ideal, or “Platonic” plane.

I actually think Plato had this backwards (though I haven’t read his work so quite possibly misunderstand him and  misrepresent his ideas here).  To my mind the ideals that we think about are poor approximations to the real world which, after all, is the reality.  The process of thinking (which is what philosophy tries to understand) involves creating abstractions that map on to real world objects and events, and in trying to find the abstractions that are both general enough to be useful, and precise enough to be truthful.

I see the role of a programming language being to fill exactly this gap.  It needs to address real world problems and tasks, but does so by generalising and abstracting and treating them mathematically.  In a sense the program exists in the Platonic plane while the implementation exists in the real world, and the language has to ensure effective communication between the two.

So “plato” is not a bad choice, but it isn’t the one I’m going to use.  I actually think “plato” would be a great name for a “platform” – like “Android” or “Gnome” or whatever.  They both start “plat”…

My next thought was to name it “Knuth” after Donald Knuth who has had some influence of my thinking as you will see in future articles. The naming of language after dead mathematicians has some history (with Pascal and Ada at least), but as Mr Knuth is still alive, using the name “Knuth” doesn’t fit that pattern.  And it would probably be a bit pretentious to try to use such a name for a little project such as this.  So that name is out.

While dwelling on my real motivation for this language, I realised that it really is quite strongly influenced by my personal experience of the last few decades of programming.  This should be no surprise, but it is worth acknowledging.  It is easy to pretend that one is being broad minded and considering all possibilities and creating a near-universal design, but that is only a pretence.  The reality is that our values are shaped largely by our past hurts, and these can only come from our past experience.  I must admit that I am escaping from something, and that something is primarily “C”.

I’ve used C quite happily since the mid ’80s and enjoyed it but have always been aware of deficiencies and it is really these that I want to correct.  I’ve watched other language appear and evolved and there have been good ideas but I’ve not found any really convincing.  Python has a lot going for it and I tend to use it for GUI programming, but when I do it just reminds me how much I like static typing.

So this language is to be my escape from C (at least in by dreams) and should be named as such.

C is seen to be a successor of B, which in turn grew out of BCPL.  So the joke at one time was to ask whether the next language could be “P” or “D”.  Of course it turned out to be “C++”, a joke of a different kind.  And then “D” came along anyway.

What do I want to call my successor of “C”?  The answer is easily “Ocean”.  Oceans are critical to life in many ways, but dangerous too – they need to be understood  and tamed.  Oceans are big and wide with many unknown and unexpected inhabitants.  If I want an arbitrary name for something else related to “Ocean”, I can use “Pacific” or “Indian” or “Atlantic”.  And of course an “Ocean” is like a “C”, but more so.

Having admitted that Ocean will follow on from C in some ways, I should explore a little what that means.

Primarily it means that Ocean will be a compilable language.  I’m not at all against interpreting  and JIT compiling but I don’t like to require them.  The runtime support code should not need to include an language parser, unless explicitly requested.  This means for example that a function like “eval”, which can be given some program text is completely out.  Similarly interpolating variable  names into strings with “…${var}…” is not an option.

Some degree of introspection is probably a good idea – I haven’t really decided yet – so it may be possible for a program to manipulate language objects.  But this must not be part of the core language and it should only exist as a library for programmers with particular needs who are willing to pay the cost.

It also means that the programmer should have a lot of control.  I’m not sure exactly what this means yet, but in general the programmer should feel fairly close to the hardware, and have an easy clear idea of when runtime support with help out and when it will stay out of the way.  Certainly the program should have a fairly clear idea about how their constructs use memory and use CPU.

Static typing is a “must have” for me.  This is essential for the compiler to be able to find bugs, and I trust compiler coverage a lot more than test coverage (though that is important too).  There is certainly room for runtime type flexibility such as variant records, or values which can be real or NULL.  These need to be available, but they should not be the default.

So that is what C means to me: static typing, compilable, and fine control.  And that is what “Ocean” must contain – at least.

Now to be fair I must address the question of whether and these early design decisions fit with my philosophy stated early – particularly aiding clarity and minimising errors.

Static typing is almost entirely about minimising errors.  By having types declared that the compiler can check, fewer mistakes will make it to running code.  The equally enhance clarity by making clear to the reader what type is intended for each value.

“Fine control” is sufficiently vague that it could mean anything.  I justify it by saying that it allows clear expression of precise low-level intention.

“compilability” really hinges on the  lack of “eval”, though static typing is often related.  “eval” effectively permits self-modifying code, and this is extremely hard for the compiler to assert anything concrete about at all.  So I feel fairly comfortable asserting that “eval” is a great way to introduce hard-to-detect errors, so it should be avoided where possible.  If some limited for of “eval” turns out to be particularly valuable, that can certainly be revisited when the time comes.

So while my language has no content, it now has a name: Ocean, and even a website: http://ocean-lang.org/.  Anything could happen next… but it will probably be something lexical.

An exercise in Language Design

When I was doing my honours year in Computer Science (UNSW, 1986) I wanted to design a new programming language.  That would be a rather large project for an honours year and naturally it didn’t happen.  I have remained interested in languages, though for most of the time that interest has been idle.

I recently wrote some articles about languages for LWN and that has re-awoken my interest in language design.  While I had scribbled down (or typed out) various notes about different ideas in the past, this time I seem have have progressed much further than ever before.  It probably won’t ever amount to much but I’ve decided to try to continue with the project this time and create as concrete a design and implementation as I can … in my spare time.

As part of this effort I plan to write up some of my thoughts as blog entries, and publish some source code in a git tree somewhere.  This note is the first such entry and it presents the high level design philosophy that I bring.  Undoubtedly this philosophy will change somewhat as I progress, both in clarifying the ideas I present here and in distilling new ideas from all the reflection that will go into the design process.  I’ll probably come back and edit this article as that happens, but I’ll try to make such changes obvious.

Philosophy

I see two particular goals for a language.  This first is  allow the programmer to express their design and implementation ideas clearly and concisely.  So the language must be expressive.  The second is to prevent the programmer from expressing things that the didn’t mean to express, or which they have not thought through properly.  So the language must be safe.

There are a number of aspects to being expressive.  Firstly, useful abstractions must be supported so that the thinking of the programmer can be captured.  “useful” here is clearly a subjective metric and different abstractions might be useful to different people, depending on what they are familiar with.  Some might like “go to”, some might like “while/do”, other might like functions applied to infinite sequences which are evaluated lazily.  The language I produce will undoubtedly match my own personal view of “useful”, however I will try to be open minded.  So we need clear, useful abstractions.

The “what they are familiar with” is an important point.  We all feel more comfortable with familiar things, so building on past history is important.  Doing something in a different way just to be different is not a good idea.  Doing it differently because you see an advantage needs to be strongly defended.   Only innovate where innovation is needed, and always defend innovation clearly.  When innovation is needed, try to embed it in familiar context and provide mnemonic help wherever possible.

Being expressive also means focussing on how the programmer thinks and what meets there needs.  The needs of the programmer are primary, the needs for the compiler are secondary.  Often it is easier to understand a program when the constructs it uses are easy to compile – as there is less guesswork for the programmer to understand what is really going on.  So the needs of the compiler often do not conflict with the needs of the programmer.  When they do it is probably a sign of poor language design which should be addressed.  If no means can be found to improve the design to it suits programmer and compiler, then the needs of the programmer must come first.

A key element of simple design is uniformity.  If various features a provided uniformly then the programmer will not be forced to squeeze their design into a mismatched mould in order to use some feature – the feature will be available wherever it is needed.  The  most obvious consequence of this is that built-in types should not have access to any functionality that user-defined types do not have access to.  It should be possible to implement any built-in type in the language rather than having to have it known directly to the compiler.

The are probably limits to this.  “Boolean” is such a fundamental type that some aspects of it might need to be baked in to the language.  However wherever that sort of dependency can be reasonably avoided, it should be.

The second gaol is preventing mistakes, and there are many aspects to this too.  Mistakes can be simple typos, forgotten steps, or deep misunderstanding of the design and implementation.  Preventing all of these is impossible.  Preventing some of them is easy.  Maximising the number of preventable errors without unduly restricting expressiveness is the challenge.

An important part of reducing errors is making the code easy to read.  In any writing, the practice of writing a first draft and then reviewing and improving it is common.  This is (or should be) equally true for writing a computer program.  So when reading the program, the nature and purpose of the algorithm and data should stand out.  The compiler should be able to detect and reject anything that might look confusing or misleading.  When reading code that the compile accepts, it should be easy to follow and understand.

This leads to rules like “Different things should look different” and “similar things should look the same”.  The latter is hopefully obvious and common.  The former could benefit from some explanation.

There seems to be a tendency among programmerx and mathematician to find simple models that effective cover a wide range of cases.  In mathematics, group theory is a perfect example.  Many many different mathematical structures can be described as “groups”.  This is very useful for drawing parallels and for understanding relationships and deep structure.  However when it is carried across from mathematics to language design  it does not work out so well.

For me, the main take away from my article – linked above – “Go and Rust – objects without class”, is that “everything is an object” and the implied “inheritance is all you need” is a bad idea.  It blends together different concepts it a way that is ultimately unhelpful.  When a programmer reads code and sees inheritance being used it may not be clear which of the several possible uses of inheritance is paramount.  Worse: when a programmer creates a design they might use inheritance and not have a clear idea of exactly “why”  they are using it.  This can lead to muddy thinking and muddy code.

So:  if things are different, they should look different.  Occam’s razor suggests that “entities must not be multiplied beyond necessity”.   This is valuable guidance, but leaves open the interpretation of “necessity”.  I believe that in a programming language it is necessary to have sufficient entities that different terminology may be used to express different concepts. This ensures that the reader need not be left in doubt as to what is intended.

Finally, good error prevention requires even greater richness of abstractions than clarity of  expression requires.  For the language/compiler to be able to catch errors, it must have some degree of understanding as to what is going on.  This requires that the programmer be able to describe at a rich level what is intended.  And this requires rich concepts.  It also requires complete coverage.  If a programmer uses clear abstractions most of the time and drops into less clear expression occasionally, then it doesn’t greatly harm the ability of another programmer to read the code – they just need to concentrate a bit more on the vague bits.  However that does make it a lot harder for the compiler to check.  Those lapses from clarity, brief though they may be, are the most important parts to check.

Unfortunately complete coverage isn’t really a possibility.  That was one of the points in my “A Taste of Rust” article.  It unrealistic to expect any formal language to be very expressive and still completely safe.  That isn’t an excuse not to try though.  While the language cannot be expected to “understand” everything, careful choices of rich abstraction should be able to cover many common cases.  There will still need to be times when the programmer escapes from strict language control and does “unsafe” things.  These need to be carefully documented, and need to be able to “tell” the language what they have done, so the language can still check the way that these  “unsafe” features are used.  This refers back to the previous point about built-in types not being special and all features being available to user-defined types.  In the same way, safety features need to be available in such a way that the programmer can make safety assertions about unsafe code.

As the language design progresses, each decision will need to be measured against these two key principles:

  • Does it aid clarity of expressions?
  • Does it help minimise errors?

These encompass many things so extra guidance will help.  So far we have collected:

  • Are the abstractions clear and useful?
  • Are we using familiar constructs as much as possible?
  • Have we thoroughly and convincingly defended any novelty?
  • Does this benefit the programmer rather than the compiler?
  • Is this design uniform?  Can the idea apply everywhere?  Can we make it apply anywhere else?
  • Can this feature be used equally well be user-defined types and functions?
  • Does this enhance readability? Can the language enforce anything to make this more readable when correct?
  • Are we ensuring that similar things look similar?
  • Are there different aspects to this that should look different?
  • Can we help the compiler ‘understand’ what is going on in this construct?
  • Is this “safety check” feature directly available for the programmer to assert in “unsafe” code.

Not all of these guides will apply to each decision, but some will.  And the two over-riding principles really must be considered at every step.

So there is my philosophy.  I have some idea where it leads, but I fully expect that as I try to justify my design against the philosophy I’ll be surprised occasionally.  For you my dear reader I’m afraid you’ll have to wait a little while until next instalment.  Maybe a week or so.

 

RAID – not just smoke and mirrors

My final talk at Linux.conf.au 2013 was about “md” software RAID.

Slides are here and video is here (mp4).

One take away, mainly from conversations afterwards, is that – there is a perception that – it is not that uncommon for drives to fail in a way that causes them to return the wrong data without error.  Thus using checksum per block, or 3-drive RAID1 with voting, or RAID6 with P/Q checks on every read might actually be a good idea.  It is sad that such drives are not extremely uncommon, but it seems that it might be a reality.

What does one do when one finds such a drive?  Fixing the “error” and continuing quietly seems like a mistake.  Kicking the drive from the array is probably right, but might be too harsh. Stopping all IO and waiting for operator assistance is tempting…. but crazy.

I wonder…

 

Wiggles and Diffs at LCA

My second talk at LCA2013 – the first one accepted – was on “wiggle”, my tool for applying patches that don’t apply.  In the presentation I wanted to explain how “diff” works – as I then wanted to explain why one of the things that wiggle does is more complex that a simple  “diff”.  For this I came up with a simple animation that I presented as a series of “impress” slides.  Some suggested I make them into an animated “gif”, so I did.  And here it is (click for a higher-res version):

 

 

 

Animation of Diff algorithm

See slides for explanation

 

 

 

Among the useful feedback I got about wiggle:

  • UTF-8 support would be good.  This  only applies to the way it breaks strings into words.  Currently it only understand ASCII
  • Detecting patterns of “replace A with B” and looking for unreplaced copies of “A” in the original might be useful.

The slides in LibreOffice format are here and the recording of the talk is here

Linux.conf.au – one down, two to go.

At linux.conf.au this week and as always it is proving to be a great conference. Bdale’s keynote on Monday was a really good opening keynote: very wide-ranging, very high level, very interesting and relevant, very pragmatic and  sensible.

One of this key points was that we should all just keep building the tools we want to use and making it easy for others to contribute.  The long tail of developers who submit just one patch to the Linux kernel make a significant contribution but wouldn’t be there if it was hard to contribute, hard to get the source, or hard to build the source.  With Linux all of these are relatively easy and other projects could learn from that … particularly the “easy to build” bit.

So let’s not worry about beating MS or Apple, or about claiming the year of the Linux anything.  Let’s just do stuff we enjoy and make stuff we use and share our enthusiasm with others.  If that doesn’t lead to world domination, nothing will.

For myself, I managed to get 3 speaking slots this year … makes up for not speaking for some years I guess.  My first was yesterday about the OpenPhoenux project – follow-on from OpenMoko.  It was very well attended, I got really good responses and positive  feedback.  I even managed to finish very very nearly on time.  So overall, quite a success.  I hope the next two (both tomorrow, Wednesday)  go as well.

You can view the  slides if you like, but they aren’t as good without all the talking.  Hopefully the LCA organisers will upload the video at some stage.