I’ve decided to write an editor. Silly idea I know – there are already two out there: both vim and emacs are quite good. I’ve heard rumors that there might be others, but investigation always shows they are just toys, not real editors. They don’t support reading email, unless that is all that they do.
And I know that I’m supposed to be designing a new language: ocean. And I will… maybe. But how can one write code for a new language without a new editor (though one could equally wonder what language a new editor might be coded in if not a new language…). Ocean will have to wait.
So why a new editor? Well I really love emacs. Really. But I also hate it. I’ve tried programming in emacs and I just can’t manage it. It isn’t just the LISP (though that doesn’t thrill me), it is the low-level hacking at the attributes in the buffer to create the image that I want to display. It feels like assembly programming. It is probably a personal weakness on my part – other people seem to manage. But isn’t it always out of personal weakness that great genius emerges? I’m sure it is.
Now that I have my general parsing worked out and understand how I want to use indents and line breaks to give a two dimensional structure to my language, I need to think about the details of some of the elements of the language. The part of a language that we use the most is the part which does stuff: statements and expressions. So that is where I will start, though particularly with statements.
In my earlier note about LR parsing I observed that many simple grammars will only ever have at most one REDUCE action in any given state. This means that there is no need for an “action table” to list which of several productions to reduce based on different look-ahead symbols. I even went so far as to say:
I personally cannot see why you would ever want a grammar which had two completed items in the same state. It means that some sequence of input tokens could be treated as one thing or as another depending only on what comes next. That sounds a like a design mistake to me. Maybe I’ll eat my words later, but for now this means I cannot find a useful example – sorry.
I have since found some examples which shed valuable light on this issue and give me a chance to see how my words taste Continue reading
In two previous articles I explored an approach to enhancing an LR parser to work with indents and line breaks. While I discovered some useful ideas and produced some code that seemed to work, I’ve subsequently discovered some serious flaws in the reasoning.
For indents, my reasoning about exactly when to REDUCE in the face of an OUT token was flawed and didn’t properly address all cases. I’ve made a few updates to that article to highlight this failing. For linebreaks, I only talked about when they should be ignored and didn’t cover the other important question of their role in terminating things. I hadn’t at the time seen how important that was.
So now I want to rectify these problems and present a more complete solution. As I have explored around these problems I’ve seen other smaller issues and made a number of changes to my approach. The big picture is still much the same but some of the details are different in important ways. I also think I understand it all much better and so will try to explain things more clearly.
My previous note introduced the problem of parsing a two-dimensional language and the need to pay attention to indenting and line breaks and use them to guide the parsing of the language, both to enhance error detection, and to resolve ambiguity. The key intuitive insight which guides this investigation is that indents imply continuation while line breaks terminate things – sometimes.
That first note only looked in detail at indents. It provided rules by which a reduction in indent level can force the end of a syntactic unit by triggering a REDUCE operation in the LR parser. This note completes the solution by exploring and describing how line breaks can be interpreted when parsing a sentence in a two-dimensional language.
Many programming languages are essentially one dimensional. The parser treats them simply as a linear sequence of tokens. A program could all be written on a single line, or with each token on a separate line and the parser or compiler wouldn’t notice the difference This set of languages includes Algol, Pascal, C, Rust and many others.
Some languages are 2-dimensional in a bad way. FORTRAN is probably the best example, though BASIC is similar. These (at least in their early forms) had strict requirements as to what can go on one line and what needs to go on a separate line.
A few languages are exploring the middle ground. Go will treat Newlines like semi-colons in certain cases which can result in a 2-dimensional feel, but brings with it some rather ad-hoc rules. Python probably makes the best attempt at 2-dimensional parsing of the languages that I have looked at. It allows newlines to terminate statements and also uses indents to indicate the grouping of some language elements, particularly statement groups.
While I find a lot to like in Python, it seems imperfect. I particularly dislike using a backslash to indicate line continuation. You can avoid this in Python by putting brackets around things as a newline inside brackets is ignored. But this feels like a weakness to me.
As I wrote in a recent article for lwn.net:
The recognition of a line-break as being distinct from other kinds of white space seems to be a clear recognition that the two dimensional appearance of the code has relevance for parsing it. It is therefore a little surprising that we don’t see the line indent playing a bigger role in interpretation of code.
This note is the first part of a report on my attempt to translate my intuition about parsing the two dimensional layout into some clear rules and concrete code. This note deals with indents. A subsequent note will look at non-indenting line breaks.
About 11 years ago I started writing “wiggle”. I have finally released version 1.0.
Wiggle is a tool for applying patches with conflicts. I do quite a bit of applying patches to a release different to the one they were created for. This often works, and often fails completely so that the task must be done completely by hand.
It is time for another diversion, but it is leading towards the language design – honest.
LR grammars are a tool for specifying the grammar for a language, and it is fairly easy to automatically generate a parsing tool from a grammar. So they have often been used for this purpose.
There seems to be some suggestion that LR is no longer to tool of choice (see wikipedia), apparently because it is hard to do good error reporting. The gcc compilers converted from an LR grammar to a recursive descent LL grammar, apparently for this reason.
However I like LR grammars and I don’t see the problem. So I plan to start out with an LR grammar approach. Either it will work well, or I will start the see the problem, and either outcome in positive in my view. So I cannot lose.
It is traditional to divide the parsing of a program into two conceptually separate stages – the lexical analysis which extracts a stream of tokens from a stream of characters, and the grammar matching stage which gathers those tokens into grammatical units such a declarations, statements, and expressions. Recognising the value of tradition, Ocean will similarly have separate lexical and grammatical stages. This note will focus on the lexical stage.
Future language design decisions will refine may details of the lexical structure of the language, however there is a lot of commonality among current languages and we can build on that commonality to establish a base approach to lexical analysis which can then be fine tuned when the details of the language are known.
I was going to start out by describing the lexical structure of Ocean, but as I thought more about it, I realised there was something that needed to come first. And that something is literate programming.
Literate programming is the idea – popularised by Donald Knuth – of writing programs like pieces of literature. They can be structured as a story or an exposition, and presented in a way that makes sense to the human rather than just to a computer.