edlib – render-lines, what a good idea.

I had a brain-wave last week.  I thought it would take me about a day to implement but it took 3 to get it working almost reasonably and the rest of the week to polish it and create some proper use-cases to exercise it properly.  But that is done now and I am quite happy with the result.  It related to the rendering side of edlib.

Commonality among renderers

My original idea for the rendering side of an editor was they there would be a number of quite different renderers, each took a document and a pane and drew the contents of the one onto the other.  I even implemented several of there: render-text to handle a standard text document with line-wrap, render-dir to render either a directory or a document-list in an “ls -l” format – with the format being configurable based on attributes.  And render-hex which takes the same text document, but displays the contents as hex, 16 bytes per line.

Having done this I noticed that was quite a lot of commonality.  Detail were different but each render need to keep track of which part of the document was currently displayed and to be able to move that part around whenever the ‘point’ moved outside of the document.  It also needed to handle movement commands which were display-based, like ‘end of line’ or ‘down one page’.  All this duplication was rather boring and I didn’t relish writing it for the next renderer I needed which would help with file-name completion.

Rendering Components: line at a time.

My brain wave was that I could avoid this duplication by abstracting out the rendering of “lines”.  Every display I was working with was essentially line oriented.  If I could get the underlying document to provide lines: lines of text, lines of hex characters, lines of directory contents, then a single renderer could draw them all, wrapping or truncating long lines, scrolling up or down to display the point, counting lines that fit so that ‘page down’ could mean something.

One stumbling block that probably consumed half the time is that lines don’t always exist in a “text” file.  For a “hex” file, a line is exactly 16 bytes, for a directory listing it is a single entry, but for a text file it is from one newline character to the next.  If a file has a million characters and no newlines, then the it would appear to be just one line and handling that line entirely in the renderer would be clumsy.   I first explored having the renderer tell the document “it is OK to give me just some of the line if it is long” but the indeterminism there caused lots of problems.

So my current solution is to leave it to the document to manage.  When asked for a line it must always provide a line, but the definition of a “line” is somewhat flexible as long as it is well defined and stable. For example the document should decide that a line cannot have 2 characters that are on a 1KB boundary.  If that threatens, then the second becomes the start of a new line.  This might lead to some strange rendering, particularly if line-wrapping is disabled.  But it should at least be repeatable.

Rendering Components: stacking

The way I implemented the “give me a line” functionality was to send a command up the pane stack from the display pane until it reached a pane with a document.  The document would return a line.  It quickly became clear that I could insert a different pane in the stack which did something different.

So if a text document is asked for a line, it returns something between newline characters.  However if a “hex” pane is asked for a line, it asks the document for characters and returns 16 of  them formatted as hex.  More complex pipelines should be possible.

My plan for filename completion is to have a directory document with an “format attributes” pane which just presents the “name” attribute in each line.  Then on top of that I stack a “completion” pane which knows that the current prefix is and hides any line that don’t start with it.  It also allows easy movement and selection of an entry.  On top of this pane is the render-lines pane which actually draws those lines (the one which match the prefix) onto the pane (probably in a drop-down menu).

Having this sort of flexibility makes the task of writing a new renderer much less daunting so I’ll probably do more of it.  I suspect I’ll eventually need something other than “lines”.  I have in mind a “Column” renderer which expects short lines and stacks them into columns of a fixed height, and then scrolls left or right if there are too many lines for all the columns.  This is how some UIs allow selection from a large number of files.  And of course some displays might not be line-based at all – maybe a renderer that interprets SVG and draws pictures?

Details of render-lines

There is a little more sophistication needed in the document or underlying pane than just providing lines of text.  It also needs to be able to find the lines of text and map cursor position to position in that text.  It would also be nice to support attributes: bold, underline, text color etc.

So there are actually two commands that can be sent.  “render-line-prev” is given a mark and must move it to the beginning of the line, or the beginning of the previous line, depending on the repeat-count that is part of the command.  “render-line” has three modes.  In the standard mode (no repeat count given) it provides a text version of the line starting at the given mark, and moves the mark to the start of the next line.  In the ‘find-point’ mode (negative repeat count) it only renders until the mark reaches point.  Then it returns a partial line.  That allows render-lines to work out where in the pane the cursor should be.  Finally a non-negative repeat count means to render until that many characters have been produced in the line, and then stop returning with the mark only moved far enough to produce those characters.  That allows render-lines to convert a pane-position into a document position so that when you click somewhere the point moves accordingly.

To support attributes, render-lines allows the reported line to contain attributes enclosed in angle brackets: <bold,fg:red>sometext</> will display sometext in bold red characters.  This is probably a preliminary design and could well change.  To include a “<” in the displayed text, use two “<<”.

There is room for more information flow from the text generator to the line renderer.  In hex mode it is nice to have the address at the start of the line and it would be nice if that didn’t shift to the left when the content moves because of a narrow pane.  It would also be nice to allow a line to be rendered differently when ‘point’ is on it: currently lines are cached and not re-rendered until the document changes.  Treating the point-line specially would allow extra highlighting of the cursor and a “reveal-codes” like approach to WYSIWYG editing where you only see the markup codes when the cursor is on them.

So there is a little way to go still before this brain-wave if fully realized, but I’m really quite happy with how it has progressed so far – if only it hadn’t taken so long.

Posted in Uncategorized | Leave a comment

edlib – commands

It been over 3 months – maybe you thought I had given up – but no.  I had some holidays and other distractions, but I’m back and will be working fairly consistently for a while.

The topic for today is “Commands”.  Commands are a standard interface for code in one module of the editor to call code in some other module.   Various objects in the editor have  commands attaches to perform various tasks specific to that object.  To some extent you can think of commands like object methods in an object-orient system, but that is only part of the story.

In many case, commands are associated with names in a key-map.  The key maps were hinted at in my previous note about managing input, but they are broader than that and details have changed since last I wrote. So maybe that is a good place to start.

Key Maps

A key map maps arbitrary strings to commands.  They are currently implemented as a sorted array allowing easy binary searching.

All input keystrokes and mouse actions are translated to strings, such as “Chr-g” for the character “g” or “S-Backspace” for the backspace key being pressed while Shift is down, or CLick-1 for a click on mouse-button-1.  The commands associated with some of these keys perform the action directly, others resubmit the input as an action, so “C-Chr-F” (control-F) might cause “Move-Char” to be be submitted.  A different module will pick this up and handle it.

Key maps are associated with panes, so when an event happens, each pane from the leaf to the root is tried in order until one has a map which supports the event.  Documents can also have key maps.  Some panes are directly connected to a document and when that pane is being processed, the keymap for the document will also be tried.

This last possibility, keymaps on document, is a fairly recent invention so I have not explored all the implication of it yet.  I might end up finding lots of uses for this so that most actions that a document can perform are performed through commands attached to a key map.

Commands more generally

Currently there are a few places where commands are attached directly to objects rather than through a keymap, though this might change.

Specifically each pane has a separate command which is called for specific actions on that pane.  It is called when the pane is “Damaged” and needs to be redrawn.  It is called when the pane is being destroyed. And it is called if the pane needs to be cloned (to create a new pane viewing the same point in the same document).  Having one command for all of this is quite possible, but a little ugly.  So this might be combined into the per-pane keymap.  Or maybe a pane will end up with a list of keymaps.

Each document has a list of “views”.  These are often panes but can be anything else.  The word-count module creates a view on any document it wants to count.

When any change happens in the document, each view is told about the change.  A pane can note that the pane is damaged so a refresh will happen.  The word-count module can clear its cached counts so they are recalculated on need.  This telling about change also happens via a command.  It is less obvious that a keymap should be provided here rather than a single command, but further experience may suggest otherwise.

Ins and Outs of a Command

So I’ve talked about how a command is used, but what is it?  It is a function pointer with a specific interface.  It is passed a “struct cmd_info” and returns an integer.  The integer can mean slightly different things in different cases. For event handling commands, if the integer is zero, then the command didn’t handle the event and something else should be tried.  For “refresh” commands, non-zero means that children need to be refreshed even if they weren’t already damaged.

The “struct cmd_info” contains various details which can have slightly different meanings in different contexts, though much of it is fairly fixed.

  • key : the name of the event or action.  This allows a single command to handle multiple actions.
  • two panes: home and focus.  The “focus” pane is where the event should happen.  For a keystroke, it is the leaf-most pane that contains the cursor.  For a “Clone” command, it is the parent to attach a clone to.
    The “home” pane is the pane where the command as found.  Each pane contain module-specific data and the commands associated with a pane may want to access that data. It may not be in the “focus”, but it will be in the “home” pane.
  • two numbers: “numeric” and “extra”.  “numeric” is a count that suggests how many times the command should be repeated.  “extra” is arbitrary extra information that can be used as needed.  The “refresh” commands use it to communicate what sort of damage happened.  Some event commands use it to connect consecutive “change” commands that can then be grouped into a single “undo”.
  • two co-ordinates: x and y.  These are used for mouse events and any event that needs a position on the screen. They are relative to the “focus” pane.
  • An arbitrary string “str”.  This is a bit like “extra”, but for text.  The “Replace” function expects a string value to be stored into the document.
  • Finally a mark and a point.  The point is usually the location of the cursor in the event window.  The “mark” is somewhere else, if it is used. It mus be in the same document as “point”.  Together that can identify a range for a document for some action to happen.  Once I work out how select and copy/paste works, the mark might get a more defined meaning.

Some of these fields may be used as outputs as well as input.  When a “refresh” function reports that children should be treated as damaged, “extra” reports the specific damage bits.  I suspect that a “choose other pane” function (use for C-x-4-?? emacs functions) will probably return the chosen pane in “focus”.

As you can see, this is a fairly restricted set of value, but has some flexibility. I really hope that I won’t find the need to extent it too much.

One way it can be extended is to attach attributes to the mark or point.  This is how the word-count module works.  It doesn’t exactly work as a command yet, but that is the plan.  It can be passed a “mark”, and it determines word and line counts and stores them in attributes attached to the mark. They can then be examined by the caller.


I suspect there will turn out to be more opportunities to use commands.  One obvious place is to handle user-typed commands, with “M-x” in emacs or “:” in vi.  The arguments to the command could be passed via the “str”.

One place where I will need some sort of call-out functionality is for customizing renderers.  I want to encourage purpose-build renders for specific needs, but there is a place for have a fairly generic renderer that is easily customizable for simple customizations.  I imagine that if a location in a document has a “render” attribute attached, then that might lead to a command which takes some roles in part of the rendering.  So far this is just a vague idea, no details.

It seems likely that there will be real changes to the nature of a command before I am done, because it is still early days.  But I’m quite sure that the idea of a “command” as a ubiquitous tool for passing control between modules is a a good one.

The Code

I’ve decided that the code has advanced enough that I have published it – partly to ensure an off-site backup.  git://neil.brown.name/edlib holds the code and http://git.neil.brown.name/?p=edlib.git lets you browse it.


Posted in edlib | Leave a comment

edlib – managing input

Commands to an editor – at least the sort of editor that edlib is designed for – usually involve single keystrokes or key combinations.  Translating those keystrokes, as well as mouse actions, into commands is the topic of this note.  Many editors also support commands that are written out with words.  Such commands are often introduced with Meta-X in emacs or “colon” in vi.  For the purposes of edlib, such interactions can be managed by creating a simple document in a small window somewhere and accepting key-stroke commands to edit that document.  Some keystrokes will “complete” the command which will make the window disappear and will perform the required action.  So word-based commands are certainly possible, but they happen at a different level.

The complex part of managing keystrokes is in tracking how they can mean different things in different contexts.  There are a number of different sorts of context that all need to be managed.

  • In today’s world it makes sense to support both “vi” and “emacs” styles of key bindings.  These are global modes which potentially affect everything – at least everything on the one display.  They determine not only how keys are interpreted but also what subsidiary modes are available.  These modes stay in effect until they are explicitly changed to something else.
  • With these there are long-lasting subsidiary modes.  “vi” has insert mode and vim adds visual mode to that.  emacs has a variety of major modes, some of which only redefine a few keys, others redefine almost all of them. These modes also stay in effect until canceled or explicitly changed, but they can be seen as subordinate to the global modes, not just other different global modes.
  • Then there are short-term modes which only last for one to two more keystrokes before the prior major mode is reverted to.  The vi commands “c” and “d” (change and delete) introduce such temporary modes.  In these modes a movement command is expected and then text which would have been passed over by that movement is deleted. In emacs C-x (control-X) enters a temporary mode were most keystrokes have an alternate meaning.
  • Numeric prefixes can be seen as introducing new modes too.  Typing ’2′ in vi will switch to a mode where the next command will be executed twice where that is meaningful.

Search commands fit into the category of “word-based” commands though they might seem quite different.  A keystroke that initiates a search, such as ‘/’ in vi or ‘contr0l-S’ in emacs, will create or enable a small window somewhere that becomes the focus for keystrokes.  As the search string is built up incremental searching or possible-match hightlighting might update the display of the target document.  When “enter” is pressed the window will close and the originating window will receive some synthetic event describing the search.  In vi you can type “d/string” and everything up to the start of “string” will be deleted.  So exactly what happens with that synthetic event from the search box will depend on the state of the originating window.

Finding an abstraction that encompasses all of the above (and probably more) requires finding effective answers to a few questions:

  • It is clear that some state needs to be stored to guide how each keystroke is handled.  How much state exactly?
  • State can be stored explicitly, such as a number to represent the repeat-count, or implicitly by selecting a different key-to-command mapping. To what extent should each of these be used?
  • How is state changed? Should all changes between explicitly requested by a command, or should some be automatic such as a prefix being cleared once it has been used?
  • Assuming that key-to-command mapping are used, should there be separate mapping for different states, or should the mappings be from state-plus-key-to-command and some part of the state be included in each lookup?
  • If there are multiple windows on multiple documents, all in the one display, then some elements of state will affect all keystrokes, and some will only affect key strokes in a particular window.  How is this distinction managed?

The choices that edlib makes are:

  • There are 4 element of state: a ‘current’ mode, a ‘next’ mode, a numeric prefix, and 32 bits available for extra information.  The ‘mode’ is a small number which can represent both long-term and short-term modes. Some modes might be vi-command, vi-insert, vi-change/delete, emacs-normal, emacs-C-x, emacs-escape.  To enter a long-term mode, both ‘current’ and ‘next’ mode are set.  To enter a transient mode, only ‘current’ mode need be set.
  • Each pane (which can be nested arbitrarily deeply) can identify a single mode+key-to-command mapping.  When a keystroke is received it is combined with the current mode and this serves as a lookup-key for those mapping.  Look up starts in the current focus window, which is often a leaf, and continues toward the root until a command is found for the mode+key, and the command accepts it (a command can return a status to say “keep looking”).
  • The current mode, numeric prefix, and extra information are provided to the command and those values are reset (current mode being set to ‘next’ mode) before running the command.  The command can make arbitrary changes, such as restoring any of the state with arbitrary modifications.
  • Command can synthesize new requests which can then be re-submitted at the current focus pane.  These requests do not affect the current state, though the caller can modify the state before and after initiating the request.  This allows responsibility for some commands to be shared.  For example the keystroke to “move forward one word” might be translated in to a virtual keystroke which always means “move forward one word”.  Different panes which might contain different sorts of document might then interpret “word” differently.  The emacs command alt-D (delete word forward) might record current location, submit that “move forward one word” virtual keystroke, and then submit a “replace” virtual keystroke which tell the underlying document to replace everything from one mark to another with the empty string.
    When a search command, as discussed earlier, opens a search window, it might store the context that came with the search command, particularly the current mode.  When the search completes and a virtual ‘search’ keystroke is synthesize, it can include the saved search context so that, for example, vi-like editor know whether to move to, delete to, or change to the search destination.

Mouse events are handled much the same as keystroke events, except that they are directory to the leaf-most pane which covers the location of the mouse event.  State is passed in an changed in much the same way as for key strokes.

Key-up events are not tracked or supported.  mouse-button-up events will be but the details are not yet clear.  Some sort of ‘grab’ will be required so that the mouse-up goes to the same pane as mouse-down.  When an even is synthesized, it can be directed to a specific location so that, for example, drag-and-drop functionality could be implemented by the mouse-up handler sending a ‘drop’ virtual keystroke to the appropriate co-ordinates.

So far, this model seems sufficient for all of my needs.  If it turns out to be insufficient in some way, it will doubtlessly be revised.


Posted in edlib | Leave a comment

edlib – displays and panes

Displaying the document being edited is, of course, very important for any editor.

I have memories of the first editor I used on Unix which was em, rather than the well known “ed”.  It actually allowed a single line to be edited “directly” rather than by using pattern substitution (s/pattern/replace/)  commands or rewriting whole lines.  It also had a ‘v’ command to view lines of context around the current line.  That ability to see what has happening helped a lot.

We’ve come a long way of course and today we expect to be able to easily see and move around the context of the place we are editing … though that “context” is usually the very simple “nearby lines of text”.

edlib needs to make it easy to provide a display of context, without mandating what that context might look like. To achieve this it provides “displays” and “panes”.  These are used for directing input from the user to the documents as well as for displaying content from the document to the user, but for now just the latter will be discussed.

A “display” is simply a rectangle that the user can see which can contain text and probably other things.  So a text window provided by the likes of “xterm”, a full graphic window in X11 or wayland, or maybe even a canvas in an HTML browser could server as a display.  A particular instance of an edlib editor can potentially have several displays of different sorts, each showing the same documents or different documents.

Each display provides a top-level “pane” which can have arbitrary children and more remote descendents.  Drawing performed in any pane is clipped to that pane and its ancestors.  Panes can be given a ‘z depth’ in which case they obscure any parts of other panes with a lower ‘z’ and which overlap in x,y space.  This allows for panes to provide pop-up dialog boxes, drop-down menus, or various other facilities.

A pane has no appearance by itself.  Something must render into it, whether that is to draw boarders, a background, or content. A typical arrangement might be that children of the root form tiles.  Each tile draws enough border to separate it from others (maybe just left-and-bottom) and then displays some document in the remainder of the pane.

A pane used to display a document might contain further child panes, maybe one per line of text, maybe two for parallel renderings of the same content in different styles.  The choice is up to the rendering code.

Each pane has a set of ‘damage’  flags which record if the pane might need to be redrawn at all.  It is expected that refresh will involve redrawing whole panes.  Which nesting level of pane needs to be redrawn will depend on the level of damage.  If just one pane is damaged and it can be redrawn without changing its size, then maybe only that pane needs to change.  Damage is normally detected when a document signals a viewer that the document has changed, and the viewer determines that this could affect some particular panes.

Implementation status

As yet the only display driver that I have implemented in an ncurses based driver to run in a terminal window. A tiling pane manager is available to divide a pane up horizontally or vertically, and a document can be viewed within any pane, complete with (very simple) scroll bar.

Working with a pixel display will clearly provide challenges that are not met when working on a simple character based display.  I suspect that I will make it possible to treat a pane in a pixel display just line a character based display, and will allow panes displaying different parts of a document to use different font sizes but still be simply character based.  This could make it fairly easy to use large fonts for headings in mark-down, or smaller fonts for text that is further away from the cursor and so probably less interesting. Lots of experimentation is needed. Support for images and general curves is important too.

I imagine that it will be useful to be able to display HTML and I don’t want to write an HTML renderer.  The idea of allowing webkit to display in one edlib pane is very appealing.  It probably wouldn’t be very useful for editing a document, but for simple display (an important role of an editor) it could provide just what is needed.

Posted in edlib | 2 Comments

edlib – because one more editor is never enough.

I’ve decided to write an editor.  Silly idea I know – there are already two out there: both vim and emacs are quite good.  I’ve heard rumors that there might be others, but investigation always shows they are just toys, not real editors. They don’t support reading email, unless that is all that they do.

And I know that I’m supposed to be designing a new language: ocean.  And I will… maybe.  But how can one write code for a new language without a new editor (though one could equally wonder what language a new editor might be coded in if not a new language…).  Ocean will have to wait.

So why a new editor?  Well I really love emacs.  Really.  But I also hate it.  I’ve tried programming in emacs and I just can’t manage it.  It isn’t just the LISP (though that doesn’t thrill me), it is the  low-level hacking at the attributes in the buffer to create the image that I want to display.  It feels like assembly programming.  It is probably a personal weakness on my part – other people seem to manage.  But isn’t it always out of personal weakness that great genius emerges?  I’m sure it is.

I’ve tried writing an editor before, as recently as last year and as long ago as my honours year at University (1986).  Each time I failed to produce anything even vaguely useful.  A big part of my problem is that I tend to let the perfect become the enemy of the good.  Whether that is the real reason I’m not sure, but I know that fact is true, and I know that I consciously tried to fight it this time.  And I’ve made a lot more progress in the last few weeks than I ever did before.  So maybe I’ve overcome my personality flaw, or maybe I just stumbled onto a good idea.  Which part of the design that good idea might be I’m not sure.  So I’ll have to just write about all of it.

Let’s start with the name: edlib.  This very deliberately pays homage to emacs.  emacs is short for “editing macros” or something similar.  It describes the technology used to build the editor.  “edlib” is short for “editing library” – again it is the technology.  edlib will (hopefully) become a library of sensible interfaces and useful implementations which can be combined to build an editor.  Any editor.  But “edlib” is also the name for a specific editor that I will build with this library.  The interfaces are important.  Creating clean abstractions that are combined through good interfaces will what will make (or break) edlib.

The general structure of the library follows the Model-View-Controller (MVC) pattern.  This highlights what I really don’t like about emacs: the model is not well separated from the view.  In edlib it will be.  Today I will write mostly about the model.

Documents: the model of editable data.

Anything that is edited or viewed in edlib is stored as a ‘document’.  A document combines a number of ideas:

  • A list of characters.  These are Unicode characters and a probably encoded with UTF-8, but each document can make independent implementation decisions there.
  • Attributes on the characters.  There are name=value pairs.  Attributes apply to individual characters, not ranges of characters.  The intention is that they record parse summaries or state information.  Code that interprets attributes may have to look around to find them and may have to interpret them in non-trivial ways.  For example apparent spelling mistakes might have an attribute attached to the first character in the word.  It might list some possible corrections.  A matching “view” module would notice that and highlight the whole word.  A related controller would provide a way to select from the possible corrections.
  • Marks identify locations in the document.  Each mark is between two characters, or at the start of end of the document.  Marks, like characters, can have attributes.  Marks are more heavy-weight then just attaching attributes, but marks are easy to find: they are linked together and can be stored in separate data structures.  Makes can identify temporary locations for communicating ranges between different modules.  They can identify the parts of a document being displayed in a particular view.  Or they can identify anything else.
  • Undo/redo support would be integrated into the document.  Each document keeps a list of recent changes, and can revert or re-impose them.  All changes are described as a “replace”.  Two marks (actually a point and a mark) identify a range to be deleted, and a string provides replacement.  Either the range or the string can be empty of course.
  • A “view” is a particular collection of marks together with a special mark called a “point”.  The marks in a view are in their own list (and also in a global list) so they can be found from each other easily.  When an change happens in the document, each view is notified and the preceding mark is mentioned in that notification.  This allows each view to easily determine if anything important has happened.
    To enable this, every change must happen at a “point”.  A point is a special mark which exists one per view, and which is linked into the point lists for all views.  A view which does not modify the document might not have a point.  Any view which can modify must have exactly one point.
    An obvious use of a view is to connect a document to a display window.  The marks might be the start and end of the displayed region, or they might be the start of each line that is displayed.  But there are other uses for views.
    A “word count” view might keep marks every 100 lines in the file, and record at each mark the count of lines, words, characters from there to the next mark.  Whenever there is a change, the counters at just one mark can be quickly updated.  Whenever the counts are needed, a sum over the relevant marks will accelerate the count.
    Another view might perform a spell-check on any changed text, and might set attributes to record the result.  Yet another view could parse some code and keep a symbol table reasonably up-to-date.
    Ultimately these different views could run in separate threads, particular if part of the task can be time consuming.

Together these ideas seem to form a powerful abstraction.  It is fairly obvious how this would apply to a text document read from a file, and that would be the most common document type.  Other possibilities include:

  • A file could be memory-mapped, and might only allow replacement where the deleted region matches the size of the added text.  This would allow even a block device to be mapped so that the editor could view raw data on a disk drive, and maybe even edit metadata that it has been taught the format of.  A hex-edit view would be particularly appropriate here.
  • Reading a directory into a document could involve creating a single character for each directory entry, and then attaching the name, owner, modes, type, date etc as attributes of that character.  This directory could then be displayed in various way to allow selection of a file name to load, or to allow more general directory editing.
  • A “virtual” document might itself provide a view on one or more other documents.  This could be used to view a collection of files as just one, or could present a “diff” view of two documents, with common sections only displayed once between differing sections.
    One awkwardness with the described directory document is that while it is easy for different displays to select different attributes, it is not clear how a different sort-order could be achieved.  A “virtual” document on a directory could impose a separate sort order, using a list of marks with one mark for every directory entry.
  • The editor would almost certainly contain a list of active documents.  This list itself could be stored as a document, as could other internal data structures which could usefully be viewed.

There are of course lots of other possibilities, but just those are enough to allow a very versatile editor to be built.  A key property of these documents is that they are linear lists of characters.  They have a first and a last and probably some in between.  Marks are very linear too, being linked together in text-order.  This does limit the range of documents to some extent.  For example an image doesn’t really fit this model (though a video might).  So I don’t really expect an image editor to ever be built with edlib. A mail reader, though, is an absolute must.

My implementation so far has just the one document type: a text read from a file supporting arbitrary edits and indefinite undo.  Just recently I completed the clean separation of that from the rest of the code so that a second document type (probably directories) can now be implemented.

And no: the code isn’t available yet.  It will be, but not until I want other people to look at it.

Posted in edlib | Leave a comment

Structured statements and expressions in Ocean

Now that I have my general parsing worked out and understand how I want to use indents and line breaks to give a two dimensional structure to my language, I need to think about the details of some of the elements of the language.  The part of a language that we use the most is the part which does stuff: statements and expressions.  So that is where I will start, though particularly with statements.

As has previously been hinted, blocks of statements will have  two different syntaxes that the programmer can choose between, one Python-like and one C-like.

Block -> : StatementList
       | { StatementList }

This is a simplified version of the grammar, but the idea is that a StatementList can follow a colon, or be enclosed in braces.  In the first case the StatementList must be terminated by a decrease in indent.  i.e. all of the StatementList must be on the same line as the colon, or indented with respect to the line containing the colon.  When we find a line with the same indent as the line with the colon, we know the StatementList is finished.

In C, any expression can be used as a statement simply by appending a semicolon, and various thing that you might think of a statements – such as assignment – are syntactically expressions.  In GNU C (i.e. C as implemented by GCC) you can even convert arbitrary statements to expressions by enclosing them in “({” and “})“.  This synergy between statements and expressions can be very powerful, but can also become ugly and confusing.  It works well for small statements placed where  expressions are expected, but quickly breaks down as statements grow.

“Rust” takes this synergy to it’s logical conclusion and every statement is an expression – there is no difference between statements and expressions.  So you can write things like

a = if some_test {
     } else {

I personally find this becomes clumsy.  Also as I wrote for lwn.net, it creates some oddities in how semicolons are handled.

“Go” deviates from C in the other direction and allows a much more limited interplay between statements and expressions.  Assignments and increments become true statements, but in a few places were you would expect an expression, a simple statement is allowed as well.  So:

while line, ok = readline(file); ok {
         do something with line;

allows a simple statement before the condition in a while. This is certainly useful, but seems a bit artificial.  It solves many cases, but isn’t really a general solution to the problem.

I like allowing general statements in the head of a while and similar places, but I don’t want to make them look like expressions – I still want a clear distinction.

So a while statement will have two forms

WhileStatement -> while Expression Block
                | while Block do Block

The first form is the one we are familiar with, though it is more like “Go” than “C” as the expression does not need to be in parentheses.  The second form allows for more complex calculation of the loop test.

Naturally the first Block in the second form must return a value somehow to guide the progress of the loop.  This is somewhat like the “return” statement in “C” which can return a value from any point in a statement block.  Using “return” in this “while” Block would be confusing as the normal expectation would suggest the meaning was to leave the containing function.  Instead I’m planning to use “use” as a keyword.  That is certainly open to change after experimentation, but it is close to what I want.  So:

     line = readline(file)
     use not line.matches("END"):
     print line

is a contrived example which shows how the “use” keyword might be used.

A noteworthy aspect of this example is that the “condition” part is larger than the “body” part.  One could go even further and imagine a body part which was simply “pass”.  This would result in a loop much like C’s “do {} while()” loop.  It may not be quite as good, as the terminating condition doesn’t syntactically stand out, but careful placing of the “use” statement can make it stand out well enough.

This generalises to having largish “condition” and “body” parts, thus allowing loops which exit in the middle and answering Dijkstra’s “n and a half loop” problem, as discussed particularly by Knuth in his Structured Programming with go to Statements.

In that article, Knuth also talks about “event indicators” which he credits to C .T Zahn. These are an attractive idea and fit quite well if we allow these expression statements in a switch construct.

      some code which can "use" multiple different values
case value1:
case value2:

Naturally the type of each of the values listed with “case” clauses must be the same as the types of values given in “use” statements.  For “event indicators” Knuth suggests that the list of allowed events might be given at the top of the statement.  This might make life easier for the compiler, but seems unnecessary.  We could allow that if all of the “use” statements give currently undefined words, and if all of those words appear after a “case” tag, then the words are taken to be values in a new anonymous enumerated type.

These enumerated values would behave a lot like “goto” labels, in that they must appear once as a target and must be used, but otherwise do not need to be declared.  Indeed, there is a lot of similarity between these “event indicators” and goto targets, and I would not consider adding a general “goto” until experimentation with the event indicators showed they weren’t enough.

In Knuth’s paper, the  event indicators can be used in loops as well as non-looping statements.  In “C”, a switch doesn’t loop,  so we clearly haven’t attained complete functionality yet.  Without detailing all the steps, all these above thoughts lead to the idea of a general structured statement with several components:

for  BLOCK or simple statement
[if | while | switch] BLOCK or condition
then BLOCK
else BLOCK
  • for” provides preliminary code which is only run once.  It may declare variables which remain for the entire statement, but it contains no “use” statements.
  • if“, “switch” and  “while” are much as you would expect and if they have BLOCKs, those BLOCKs contain “use” statements, the value of which guide the execution of the other components.
  • do” is only present with “while” and just runs code providing the condition doesn’t fail.
  • then” and “else” are run if the condition or BLOCK returns “True” or “False” respectively.  As such, they are somewhat equivalent to “case True” and “case False
  • the “case” stanzas allow the condition to return a variety of values.  If none of the “case” values match, then the else” clause is used if present.

If the values returned by the condition in a “while” are not Boolean, then it isn’t immediately clear which values cause the “do” block to run and the statement to loop.  My current thinking is that if the condition does not return a value at all, then that is equivalent to “True”, in that the “do” block is run and the statement loops.  If the condition returns any value other than “True”, then the loop aborts.

If a (“while“) loop starts with a “for” clause, then there can also be a “then” clause after the “for” and before the “while“.  Despite its positioning it is run after the “do” part if the condition didn’t fail.  This allows an effect similar to the C for loop:

for a = 0
then a += 1
while a < 10
do sum += a

There are aspects of this that feel clumsy, though I suspect they can be resolved.

  1. Requiring the ‘then’ keyword isn’t good. If the condition is a block then it makes sense, but if it is just a condition, then it isn’t wanted. So:
    if condition:
        use condition
  2. A large switch statement has no obvious end.  If a new line at the base indent level starts ‘case’ or ‘else’ then the switch statement continues, otherwise it is terminated.  This fact has some precedent in that languages with an “elseif” keyword can have a list of clauses each beginning with “elseif” which is only really terminated by a line that starts with something else.  That doesn’t seem to cause confusion, so maybe “case” with no termination won’t either.
  3. The word “then” isn’t quite perfect for the “increment” part of a “for” loop.  It isn’t too bad as one could say “while this do that and then the other” so “then” suggests something to be done a bit later.  It might just take some getting used to.
  4. I’m not even sure about “use“.  I like the concept.  I don’t much like the word.  I really don’t want a bare expression where you might expect a statement and that could be syntactically confusing.  I wonder if “select” would work, as it selects the case to exit through.

At this stage I am not planning on including “break” or “continue“.  “break” can sometimes be achieved with “use false“.  In a few contexts, “use true” will do the same as “continue“.  There might be times where these directives are still wanted in the “body” of a loop, but I hope not so I will wait until I find some convincing evidence.

This certainly isn’t the final word on loops.  At the very least I expect to add a “foreach” loop to work with some sort of “iterator” type.  However I cannot really explore that until I have a type system.

For now, I have  put together a simple interpreter which allows some experimentation with the loops and conditionals discussed here.  Now I need to think of some interesting loops and selection code that I can try it out with.

The interpreter is in my ocean git tree (git://ocean-lang.org/ocean), or you can read the literate source code.

Posted in Language Design | Leave a comment

A case for multiple completed items in LR parser states.

In my earlier note about LR parsing I observed that  many simple grammars will only ever have at most one REDUCE action in any given state. This means that there is no need for an “action table” to list which of several productions to reduce based on different look-ahead symbols.  I even went so far as to say:

I personally cannot see why you would ever want a grammar which had two completed items in the same state.  It means that some sequence of input tokens could be treated as one thing or as another depending only on what comes next.  That sounds a like a design mistake to me.  Maybe I’ll eat my words later, but for now this means I cannot find a useful example – sorry.

I have since found some examples which shed valuable light on this issue and give me a chance to see how my words taste.

In the first case, the “sequence of input tokens” that could have varying interpretations is empty. It hadn’t even occurred to me to consider such a case, but empty sequences can be important.  They can be particular important for optional elements. In this case optional prefixes.

Consider the grammar:

Number -> Sign Digits
Sign -> -
      | +

This says that a “Sign” is a “+” or a “-” or nothing, and that a “Number” is a “Sign” followed by “Digits” (whatever they are).  In this case, if the next symbol is whatever “Digits” starts with, then we need to REDUCE the empty sequence to “Sign” before we can SHIFT that symbol.  By itself, this doesn’t cause a problem.  If we see a “+” or “-” we SHIFT, otherwise we REDUCE.

However suppose that variable names can be prefixed with “$” or “@” with some  unspecified meaning, and suppose we have:

Value -> Number
       | Variable
Variable -> Sigil Name
Sigil -> $
       | @

Now if we want a Value and see a ‘$’ or an ‘@’ we assume a “Variable” is coming, while if we see a “+” or a “-” we assume a “Number” is coming.  But what if we see something else?  If we see something that starts a “Name”, we should reduce the empty sequence to “Sigil” while if we see something that starts “Digits” we should reduce the empty sequence to “Sign”.  So here is a case where “some sequence of input tokens could be treated as one thing or as another depending only on what comes next“.  The “sequence” is empty, the “one thing” is a “Sigil” and “another” is a “Sign”.

This situation would be even more interesting if there was some symbol that could be either a ‘Sign’ or a ‘Sigil’.  A grammar which included that could still be parse with an LR(1) parser (providing Names and Digits could be distinguished in their first token).  However that is something that I would consider a design mistake.  I don’t need to be able to handle that, but I would like to be able to handle distinct optional prefixes.

A similar situation came up as I was trying to design the “Newline” handling for my “ocean” parser.  After  an “if Expression Statementlist” I want either 1-or-more newlines, or o-or-more newlines and an ‘else’.

CondStatement -> IfPart IfSuffix
IfSuffix -> Newlines
          | OptNL else Block
Newlines -> NEWLINE
          | Newlines NEWLINE
OptNL -> Newlines

After parsing the “IfPart”, a “Newline” should always be SHIFTed and then reduced to “Newlines”.  But otherwise we can either REDUCE that “Newlines” to “IfSuffix”, or to “OptNL”, the latter if “else” is the next symbol.  So again the one sequence can be treated as two different things (Newlines or OptNL) depending on what comes next.

These can both be resolved without changing the parser by spelling things out a bit more:

Number -> + Digits
        | - Digits
        | Digits
Variable -> $ Name
          | @ Name
          | Name
IfSuffix -> Newlines
          | Newlines else Block
          | else Block

This requires “Digits” and “Name” and “else Block” to appear multiple times, but otherwise is reasonably simple and has exactly the desired effect.

So what to do?  One option would be to manually apply transformation like the above when the problem arises.  That is simplest and what I have done so far, but it isn’t entirely satisfactory.  Repetition, as it requires, feels clumsy and is error prone.

Another option is to bite the bullet and  change to a full LALR grammar with a more complete “action” table.  This would be awkward because it still means that some states allow multiple different REDUCE steps.  The algorithm I developed for handing indents and newlines requires that in some circumstances we force a reduction.  That algorithm would need to be refined to be able  to identify which reduction should be applied.  It isn’t immediately obvious how that would work.

We could teach the parser generator to apply the same transformation that I applied above.  Whenever any production contains a nullable symbol it is duplicated to have a second production which doesn’t contains that symbol.  Then any production that causes the symbol to derive to the empty string gets disabled.  This could probably be made to work, but seems like it could make the grammar much more complex, and so make it harder to understand when things go wrong.

I think what I would like to do is find a more minimal transformation of the grammar.  One which still results in at most one REDUCE action per state, but which as close as possible to the original.

If we consider the second example with the newlines, then the problem occurs in a state with the following itemsets (as reported by my parsergen program with –LALR enabled).

    Newlines -> Newlines . NEWLINE [14]
        LOOK AHEAD(6): NEWLINE, $eof, else
    OptNL -> Newlines . [15]
        LOOK AHEAD(5): else
    IfSuffix -> Newlines . [18]
        LOOK AHEAD(0): $eof

If we consider the middle item, effecting the REDUCE operation that it implies would pop the “Newlines” off the stack and replace it with “OptNL” with an updated ‘state’.  This state would allow ‘else’ to be SHIFTed, but otherwise would be ignored.  As “OptNL” and ‘Newlines” are very similar symbols, neither of which carry any extra ‘AST’ information, there would be no harm in just leaving the “Newlines” on the stack (as we noted previously, the symbols on the stack are hardly used)  and directly SHIFTing in the ‘else’.

So if we add to the ‘goto’ table for this  state an entry indication that ‘else’ can be shifted and go to whichever state ‘else’ ultimately leads to, and otherwise ignore this item so that the only reducible item (or “completed item”) is “IfSuffix -> Newlines .”, then the parser will be able to correctly parse the language without the multiple REDUCE problem.

The other example is a little different and not so simple.  The problem item there contains, among many other items:

    Sigil -> . [8]
        LOOK AHEAD(1): Name
    Sign -> . [12]
        LOOK AHEAD(2): Digits

There are several differences here.  Firstly reducing the productions does change the stack – it pushes a single frame on.  Secondly, both production needs to be treated the same.  In the first case we could say “OptNL” is similar to “Newlines” while “IfSuffix” is very different.  That distinction doesn’t apply here.

We cannot convert either of these into simple SHIFT operations like we could previously, but a “REDUCE+SHIFT” operation where the shift is very simple might be an option.  The “REDUCE” part would not pop anything off the stack and just push on a very minimal frame.  This would require that there was no code associated with either production.  The symbol either has type “void”, or has some small type (say: 8 bytes at most) which is left as zeros.

Pulling this all together, I imagine two changes.

Firstly, an extra bit is added as a flag to each entry in the “go to” tables.  We can probably steal a bit from the “state” value and so limit our grammars to 32767 states.  If the parser chooses to shift a symbol which leads to a ‘go to’ entry with this bit set, it first shifts in an “empty” stack frame with no symbol, no state, and an ‘asn’ (Abstract Syntax Node) with 8 bytes of zeros.  This is the only change made to the parser.

The parser generator is changed to translate some ‘completed items’ into SHIFT actions instead of REDUCE actions.  This only applies when the production in the item has no code fragment attached and the structure attached to the head of the time is no larger than 8 bytes.  It also only applies if all symbols in the item’s look-ahead set are not present in other terminal items.

If the body of the item is empty, then a ‘go to’ entry may be created with the new flag set.  If the body contains a single symbol as the same type as the head of the item, a regular ‘go to’ table entry might be created.

There are various other details about how to determine the target state and how to decide if the transformation is legal.  A particular situation that would make the transformation impossible is if the state in the synthesized frame might ever by used.  These details would need some careful thought and analysis which I don’t feel is particularly justified just now.  After I have had more experience, and possibly found more interesting case, it might then be time to explore extension to the parser generator in greater detail.

So there is no code to show this time, just an admission that I still have things to learn…

Posted in Language Design | Leave a comment

Line breaks – a reprise.

In two previous articles I explored an approach to enhancing an LR parser to work with indents and line breaks.  While I discovered some useful ideas and produced some code that seemed to work, I’ve subsequently discovered some serious flaws in the reasoning.

For indents, my reasoning about exactly when to REDUCE in the face of an OUT token was flawed and didn’t properly address all cases.  I’ve made a few updates to that article to highlight this failing.  For linebreaks, I only talked about when they should be ignored and didn’t cover the other important question of their role in terminating things.  I hadn’t at the time seen how important that was.

So now I want to rectify these problems and present a more complete solution.  As I have explored around these problems I’ve seen other smaller issues and made a number of changes to my approach.  The big picture is still much the same but some of the details are different in important ways.  I also think I understand it all much better and so will try to explain things more clearly.

The Tokens

The tokenization of indents and linebreaks is unchanged.  Every change in indent results in an IN token or an OUT token, which are paired.  Every line break results in a NEWLINE token.  However the NEWLINE token that would be expected before an IN is moved  to after the matching OUT.  So every OUT is both preceded and followed by a NEWLINE.

Recursive Symbols.

Understanding recursive symbols is very important for managing indents and linebreaks.  A key property of both NEWLINE and OUT tokens is that they might “close” something. i.e. they sometimes  indicate that everything since the start of the line, or since the matching IN, forms some unit which cannot be further extended.  This “unit” is some (non-terminal) symbol (A) and has some production for this particular instance:

A -> B C D

If the top few symbols on the parse stack are “B C D” and the context is such that an ‘A’ was expected, then we might assume that we are at the end of the production and it cannot be further extended.  However if any of B, C, or D are recursive, then that isn’t the case.

A “recursive” symbol here is one with a production which starts with the same symbol. e.g.

D -> D E

If such a production existed then having “B C D” on the stack does not mean we are at the end of the “A”, as an “E” could come along producing “B C D E” which could be reduced to “B C D” and so the “A” has moved further into the input stream.

Obviously if we found “B C D” on the stack when we saw an OUT or a NEWLINE we  could reduce that to “A” and be sure that “A” was closed.  However if “A” is recursive that isn’t enough.  If “A -> A C” were a production, then a “C” could come along and extend the A.

The conclusion is that recursive symbols require special care.  We cannot simply disallow them as they are very useful for any “list” like construct.  So we need to make sure we can always remove them from the stack.

As suggested in a previous article, the best approach is to create an intermediate symbol “$A” for each recursive symbol “A”, add the production “$A -> A”, and then wherever “A” appears in a non-recursive position in the body of a production, replace it with “$A”.  So in the final grammar a recursive symbol will only appear in the body of a production when it is also in the head of that production, or in the synthesized “$A -> A” production.  Then when we have ‘A’ on the top of the stack and need to ensure it is closed, we simply reduce it to “$A”.

Understanding the Stack

In my original implementation I constructed the stack in a less than optimal way which made some of the code quite clumsy.  An important property about the stack is that it contains symbols and states, and these alternate.

At the start, state “zero” is pushed onto the stack.  Then the first token is pushed on as a terminal symbol.  The symbol is looked up in the “go to” table for the state below it, and the new state is pushed onto the stack.  Then we either REDUCE or SHIFT but in either case we will add a symbol, and then use the “go to” table to find the next state.

My original implementation combined the state with the following symbol into the one frame.  This seemed to make sense as a state was on the bottom of the stack, so a state should go on first.  However another way to look at it is that each state is intimately connected with the preceding symbol as the symbol leads directly and immediately (though a “go to” table) to the state.  The code turns out to be much nicer if a state is combined with that preceding symbol to form a frame, though maybe it is better to say, a symbol together with the subsequent state.

The first frame on the stack then contains state zero and no symbol.  This feels a little bit untidy, but makes other things much cleaner.  As the state never changes from zero we could possibly do away with that frame altogether, but that isn’t really a net win.

With this clear understanding of the stack, as “symbol -> state” pairs, we need to ask how pending indents are recorded.  As we determined before, indents often appear inside symbols and keeping a count together with the symbol is important and unchanged.  The interesting question is how to store the indent that might be at the start of a symbol.  To put this another way, an IN token appears between two tokens.  Should it be counted with the one before, or with the one after?

We could argue that the very first token might be an IN and it would have no previous token to store the count with.  However we have already seen that the first frame on the stack has an absent token.  If the first parsed token was an IN, it could be counted with this absent token in the first stack frame.  We could even pretend that there was a “start-of-file” token to match the “end-of-file” token which terminates the parse.  However this is just book keeping and doesn’t really provide a firm answer.

In my previous analysis I counted each IN with the following token and found the need to distinguish the case of an indent being at the start of the symbol, against all the indents being within the symbol.  This tells us that the indent immediately before a symbol doesn’t entirely fit with the indents within the symbol.  So maybe it is best to count the IN with the preceding symbol.

We are getting a bit ahead of ourselves here, but one of the important places were indents are needed is determining when to ignore NEWLINEs. This determination is based on the states that are in the stack.  Some states identify as “starts_line” and if there is such a state on the stack with no subsequent indent, then NEWLINEs are not ignored.  If there is an indent since the last “starts_line” state, then NEWLINEs are ignored.   However an indent between the same two symbols as the state is between, does not cause newlines to be ignored.

This suggests that if an IN and a state are between the same pair of symbols, the IN should be “before” the state.  This is most easily handled if the IN is counted with the preceding symbol.  So that is what we will do. It removes all the need to track whether an indent was at the “start” of a symbol, as it never will be.

Which states expect a NEWLINE

One of the important properties of indents is that they cause newlines to be ignored – sometimes.  An indent signals continuation, the newline that was before the indent has already been moved to after the indent to reflect this continuation, but what about newlines between the IN and the OUT?  If the syntax elements in there don’t expect newlines to terminate or separate them, then any newlines found need to be ignored.  But how do we know?

I talked previously about “line-like” symbols in the syntax but wasn’t entirely happy with the definition.  They were states that were followed by things that other times followed newlines, and it isn’t really clear what that means.

Both an OUT and a NEWLINE will trigger REDUCE actions to close out pending productions which could otherwise extend beyond the OUT or NEWLINE.  They need to eventually get one symbol (non-recursive when an OUT is present) on the stack which represents the indented section (and possibly what lead to it), or the line (or lines).  This symbol needs to be found before the NEWLINE is SHIFTed.  So a “line-like” symbol must be followed by a newline.

The important difference here is that I was previously imagining that a line-like symbol would end with a newline, but that didn’t seem to work out.  It seems to work better if NEWLINEs are treated as separators rather than terminators so the NEWLINE is  not part of the line-like symbol, but part of something bigger (block-like?) which contains multiple lines.

Following this observation, the parser generator needs to determine which symbols can start with a NEWLINE (previously it determined which could end with a NEWLINE), and then any symbol which is followed by a starts-with-newline symbol is deemed to be a “line-like” symbol.  Finally any state which is followed by a “line-like” symbol is “starts_line” and must expect a NEWLINE.

This seems to make sense.  In a state which expects a symbol that is normally followed by a newline, we need to not be ignoring newlines.  In any other state it is safe to ignore newlines if there has been an indent.

A core idea here, important when designing the grammar, is that when NEWLINEs are not ignored, each line will be reduced down to a single non-terminal symbol, which is followed by the NEWLINE.  A NEWLINE should never be placed, in the grammar, after something that is just part of a line, e.g. `Variable = Expression NEWLINE`.  I must be after a line-like thing: `Assignment NEWLINE`.

What, exactly, to reduce when we see a NEWLINE or OUT.

When we see a NEWLINE in a context where a NEWLINE is expected (not ignored) we want to reduce everything back to the start of the line.  This means that we need to record where lines started in much the same way that we need to record where indents were found.  We don’t need to count how many there were, just whether there was one in, or following, a symbol or not.

This became particularly apparent to me when considering the parse of

if Condition : simple-statement

The grammar is (deliberately) ambiguous in that a NEWLINE at the end of this line could separate the “simple-statement” from a subsequent statement within the body of the “if”, or it could separate the whole “if” statement from a subsequent statement at a higher level.  Both the “if” statement and the “simple-statement” are line-like.  Clearly (I hope) the whole needs to be reduced (to a “statement”), right back to the start of the line.

So when we process (SHIFT or DISCARD) a NEWLINE or IN (both of which signal the start of a line), we record a line-start in the topmost stack frame  (just after the symbol in that frame, and before the state).  We also keep track of how many stack frames do not contain a line-start.  This is easily done by storing zero when a line-start is present and storing one more than the previous frame when there is no line-start.

Then when we see a (non-ignored) NEWLINE and it cannot be SHIFTed we must obviously REDUCE or trigger an error.  If we can SHIFT the NEWLINE we may still want to REDUCE first.  Specifically, if the line isn’t already reduced to a single symbol, and if we can perform a REDUCE with the number of symbols being popped not more than the distance to the nearest start-of-line, then we perform the REDUCE.  If there is only only symbol in the line, if we cannot REDUCE, or if the reduction would pop more off the stack, then we know that when we SHIFT the NEWLINE it will serve to separate the whole line from whatever comes next.

When we see an OUT, we similarly want to REDUCE until we find the matching IN on the stack.  In the original design I would reduce until the top-of-stack contained an IN, and then would cancel that.  That is still true but only if it is possible.  We need to understand what to do when it isn’t possible to reduce but the top stack frame doesn’t contain an indent. Consider a general line-like symbol with a production:

A -> B C D E

This could be found with an IN after B, then an OUT and a lesser IN after D, as in

    C D

This would tokenize as “B IN C D NL OUT IN E NL OUT NL”.  The first two NL would be ignored as they are indented.  The first OUT should reduce back to D, but no further – even though the most recent IN is still two frames away. The second OUT would reduce to E with the nearest IN 1 frame away, but it requires the final NL to reduce the whole thing to A.

If we can reduce so that the top of stack contains an indent it is definitely safe to cancel the OUT.  If we cannot reduce that far then sometimes it is safe, as above, but sometimes it will not be – we would instead need to enter error handling because some production is still open that needs to be closed but is not yet complete.  To be able to tell if some production is “open” we need to know details about the pending productions but a state doesn’t really carry that information.  It is all present in the “itemsets” that are used to build the states.  It seems we need to preserve some more information from that parser generator into the parser.

The itemset which is the basis for each state contains a number of productions each with DOT at some point in the body.  Those productions where DOT is at the start are not really open yet.  There is no sense at all in which they are committed to and no contents have  been found.  So there is no need to close them.  The  remainder must all start before the indent that is being cancelled, much as in the example above, the “A -> B C D E” production started before the first indent.  If a production in the state starts at or after the recent indent then cancelling the OUT would be premature.  For the parse to be able to detect this case it needs to know the least non-zero location of DOT in any production in the itemset which leads to each state. We will call this the least prefix.

This number is trivial to calculate and pass to the parser.  With it the parser can easily make the decision to REDUCE, ERROR, or CANCEL when it sees an OUT.

  1. If the current state can be reduced and the length of the reduction is not more than the distance from the most recent indent, then we reduce.  This means that of all the stack frames popped as part of the reduction, only the first can contain an indent.  As a consequence any reduction of length zero or one will happen, and so any recursive symbol will be reduced away.
  2. Otherwise if the least prefix of the state contains an indent, then the top most indent is cancelled and the OUT is discarded.
  3. Otherwise we cannot close something that is wholly within the indented section and must raise an ERROR.  This will allow us to remove things from the stack until the OUT can be cancelled.

So there are some similarities between reducing on NEWLINE and OUT, and some differences.

The similarity is that we need to track where  the “other end” (start of line, or IN) is and force a REDUCE within that range, even if a SHIFT is possible.

One difference is that for OUT, the range includes the symbol containing the IN and we reduce anything in that range.  For NEWLINE, the range excluded that symbol and we only reduce if there are at least 2 symbols in the range.  This is related to the fact that recursive symbols are problematic for IN/OUT management but not for NEWLINEs.  With newlines, the NEWLINE itself will ultimately be shifted which will stop a recursive symbol from being extended inappropriately.  With IN/OUT, the OUT is not shifted, so we must completely close any recursive symbols first.

The other difference is that an inappropriate NEWLINE causes a normal error because it cannot be shifted. An inappropriate OUT causes an error because the state says it cannot be cancelled.

Summary of details

Now that we know how it can all work, we need to make sure we have the details straight:

  • The parser generator needs to prevent recursive symbols (with productions “A -> A …”) from appearing other than at the start of the recursive production, or as the sole body element in another production.  If they do a new symbol is created with a single production “$A -> A” is added.  This new symbol is used in place of the recursive symbol where it isn’t allowed.
  • The parser generator needs to record which states expect a newline.  These are precisely those that can be followed by a line-like symbol.  These, in turn, are those which can be followed in a production by a NEWLINE or any symbol which starts with a NEWLINE.
  • The parser generator needs to record the minimum non-zero distance from the start of each item to DOT in each itemset.  This “minimum prefix” is recorded for each state.
  • The state stack needs to contain
    • A symbol (possibly ‘$start_of_file’) with associated parse information (e.g. abstract syntax tree fragment),
    • the state which follows that symbol (state zero for the first frame on the stack),
    • the number of indents within or immediately after the symbol.  These are considered to be before the state,
    • a count of the number of stack frames since the start of  a line.  Zero implies an IN or a NEWLINE is within the symbol, or that this is the first frame. A larger number is one more than the number in the previous frame and implies that the symbol contains no NEWLINEs or unpaired IN tokens.
    • a count of the number of stack frames since the last unmatched  IN.  As all states with a non-zero count will have zero for the number of indents, this value can be stored in the same field as the indent count, but with a different sign.
    • A flag “expect_nl” indicating whether an indent or a “starts_line” state is closest to the top of the stack. It is true if “starts_line” is closest.  If the state in a frame is “starts_line”, then the flag is “true” no matter whether there are indents.  Otherwise the flag is true only if it is true in the previous frame, and there are no indents in this frame.
  • When an IN token appears in the look-ahead, we increment the indent counter in the top stack frame, set the frames-since-start-of-line counter to zero and set the frames-since-last-IN counter to zero
  • When an OUT token appears in the look-ahead we REDUCE if the current reduction uses no more symbols than the frames-since-unmatched-IN count. If the state cannot be reduced or has a longer reduction pending, then cancel the OUT against the top-most indent if the frames-since-last-IN counter is at most the minimum prefix of the current state, otherwise initiate error handling.  This decrements an indent counter.  If that becomes zero we need to increase the frames-since-last-IN counters in the upper regions of the stack.
  • When a NEWLINE token appears in the look-ahead, if “expect_nl” in the top frame is false, then the NEWLINE is discarded.
  • When a NEWLINE token appears and “expect_nl” is true, then if the current state can REDUCE and the reduction length is no more symbols than the frames-since-start-of-line count, and that count exceeds 1, then we REDUCE.  Otherwise we SHIFT if we can, REDUCE if we can’t, and ERROR if we cannot even REDUCE. In any case we set the (new) topmost frame to have zero frames-since-start-of-line.
  • When we reduce a series of frames,
    • The number of indents in all the frames are summed and the result saved in the new frame
    • If any frames-since-start-of-line value is zero, the new frame gets a zero, else it gets the value from the first frame
    • If there is no indent, then the frames-since-unmatched-IN is set to one more than the previous frame.

Implications for grammar design.

The most important implication this has on grammar design is that any symbol that appears immediately before a NEWLINE (or before any symbol that starts with a NEWLINE)  is considered be a line-like symbol.  As such, NEWLINEs found while collecting the symbol will not be ignored, even if the symbol is on an indented line.


Statement -> Variable = Expression NEWLINE

would cause “Expression” to be treated as line-like, so in

a =
  b + c +

The NEWLINE after the first ‘+’ would not be ignored and would cause an error.

Consequently the NEWLINE should be placed after a Statement or similar, possibly like

Statement -> Variable = Expression
StatementList -> StatementList NEWLINE Statement
               | StatementList ; Statement
               | Statement

This will expect NEWLINES wherever a StatementList is expected, but not after a subsequent indent.

Another effect to be careful of is the handling of optional NEWLINEs.  A construct like

OptionalNewline -> NEWLINE

will sometime work, but often not.  In particular, if this occurs at the end of a section that can be indented, an OUT NEWLINE sequence could appear where OptionalNewline is expected.  The OUT will likely force a reduction of “empty” to OptionalNewline, where the intent was to capture the following NEWLINE.

One was to handle this is to tie the NEWLINE to the following symbol.  So if you wanted a newline before a ‘{‘ to be optional, then instead of “OptionalNewline {” you need

Open -> NEWLINE {
       | {

and then just use “Open”.  In this case, the OUT cannot force a reduction.  The reduction won’t happen until the ‘{‘ has been seen.

Error Handling

A couple of places in the preceding text I’ve mentioned error handling without any details.  Some details are in previous posts but they are imperfect.  In particular, where it is suggested  that a symbol to SHIFT might be synthesized, that is unworkable.  While the symbol itself could be synthesized automatically, the abstract-syntax-tree fragment for it cannot.

Full details on error handling will have to wait for another post.  One observation I can make now though is that symbol synthesis can be probably be managed  with productions like:

Expression -> ERROR

The reduction code with this can create some appropriate AST element.  Exactly how this will be integrated remains to be seen.

Indents in action.

This article is being published months after I wrote most of it.  This is largely because it took too long to get the code working. There were a few little bugs in my design and I found it hard to find the time to concentrate and find them and fix them.  This is part of the cost (and benefit) of wanting to have working code for all my ideas…

The current parser generator implements almost all of these ideas and can demonstrate working code.  It doesn’t current automatically hide recursive symbols, you have to do that step by handle.  Accompanying this I have a trivial grammar for testing indents and newline handling. This will read in something that looks like code with non-line-like expressions and line-like statements and will print out the parsed structure.

If you collect git://ocean-lang.org/ocean and “make ; make tests” in “csrc” you will see it run.


Posted in Language Design | Leave a comment

Linebreaks in a two-dimensional language

My previous note introduced the problem of parsing a two-dimensional language and the need to pay attention to indenting and line breaks and use them to guide the parsing of the language, both to enhance error detection, and to resolve ambiguity. The key intuitive insight which guides this investigation is that indents imply continuation while line breaks terminate things – sometimes.

That first note only looked in detail at indents.  It  provided rules by which a reduction in indent level can force the end of a syntactic unit by triggering a REDUCE operation in the LR parser. This note completes the solution by exploring and describing how line breaks can be interpreted when parsing a sentence in a two-dimensional language.

To start getting a feel for how line breaks work it will be useful to have an example to work with

1: if condition:
2:    a = b +
3:        c +
4:        d
5:    print a
6: else:
7:    print "message"
8: return

Here the numbers are the start of the lines are just for our reference and not part of the program fragment.

The handling of indents described in the previous note will ensure that, for example, the expression that starts on line 2 with “b +” will be terminated before line 5, as line 5 has a lesser indent.  However the indents by themselves do not tell us that the assignment starting on line 2 is a separate statement from that on line 5.  To determine that, we need to process the line breaks.

Some of the line breaks have little meaning.  Those at the end of lines 2 and 3 could be moved back or forward over at least one token with no important effect.  Also the line breaks at the end of lines 1 and 6 are only really important in that they introduce an indent.  The fact of the line break is otherwise uninteresting.

On the other hand the line break at the end of line 7 closes the “print” statement on that line, as well as the “if” statement which started on line 1.  An alternate reading is that it closes both the “print” statement and the “else” clause that started on line 6.  With the else clause closed, the “if” statement must be closed.

So it appears that sometimes we need to discard line breaks and other times we need to multiply them.  How can this work?

The apparent need to multiply line breaks can be handled by moving some line breaks that are otherwise uninteresting.  As noted the line breaks that immediately precede an increase in indent are uninteresting beyond that indent. So we can move them to after the matching decrease in indent.

To make this a little clearer, consider the following which has the same two dimensional structure as the previous example but with simplified content.


When we tokenize this we create “IN” and “OUT” tokens to record the indents, and  add a “NL” token (newline) to record the line breaks. Any NL that would be before an IN is moved to after the matching OUT.  This reflects the fact that the indented content is in some sense a continuation of the starting line.

So the tokens produced by that above are:


So there are 8 “NL” tokens as you would expect for 8 lines, and each OUT  token is both preceded and followed by an “NL”, allowing both the inner and outer structures to possibly be terminated.

With this parsing of newlines there is no longer any desire to multiply newlines, only to sometimes ignore them.

Taking a new example – similar to the earlier one but with two distinct halves:

1: if condition:
2:    print message1
3:    print message2
4: a = b +
5:     c +
6:     d

Here lines 1-3 have the same two dimensional structure as lines 4-6.  However the newline token at the end of line 2 must be meaningful while the newline at the end of line 5 should be ignored.

This distinction tells us that we cannot handle newline tokens purely automatically like we do with IN and OUT tokens.  Rather the grammar for the language must tell us where newlines are expected, such as at the end of a statement.
So when the parser sees a NL on input it has a number of options.

  1. It can SHIFT the NL.  This is only legal if the current state has a ‘go to’  entry for NL.  If that is the case, then SHIFTing is most likely a sensible option, though possibly we should wait until a later NL after an OUT.
  2. It can REDUCE the production which is reducible at this state, if there is one.  This only makes sense if  this could lead to a state where NL  can be shifted.
  3. It can discard the newline.  This should happen for example at the end of line 5 in the example above.  Or
  4. It can trigger error handling.  This would be appropriate at the end of line 4 above if line 5 were not indented.

The important choice here is whether to DISCARD the NL or not.  If we choose not to, then choosing between SHIFT, REDUCE, or ERROR is done in the same way we would for any other token.

This choice of discarding newline tokens must be closely related to the level of indent.  The intuition is that indents lead to continuation.  A block of text that is indented is a continuation of the line that preceded the indented block.  Sometimes we ignore line breaks in indented text, other times we don’t.  We need to determine how to tell the difference.

Falling back on my intuition again, I think that there are line-like things and word-like things.  These may also be referred to as vertical or horizontal objects, to borrow some terminology from TeX.  Statements and things composed of them are line-like.  Expressions and things they are comprised of are word-like.

My intuition suggests that we would expect a line break to terminate a line-like thing, but not a word-like thing, unless in doing so it also terminates a line-like thing.  In the example above, lines 2 and 3 are line-like things, so we expect the line break to terminate them.  Lines 5 and 6 aren’t.  Lines 4,5,6 together form a line-like thing and so the new line that was moved to after the OUT indent reduction will terminate that whole line-like thing.

Returning to the first example, the assignment starting on line 2 is separate form the statement on line 5 because statements are line-like and there is a linebreak separating them.

Were we using a recursive descent parser these ideas would seem quite easy to implement.  We would first deduce which non-terminals were line-like as being precisely those that could ever expand to contain a newline token.  Then, as a recursive descent parser always knows what non-terminal it is trying to recognise, it could ignore newlines precisely if there had been an indent since the deepest line-like non-terminal started.

As we are using an LR grammar, and so don’t always know what non-terminal we are trying to recognise, we need to take a slightly different approach.  With LR, we do always know what symbol preceded a particular state.  This is because each item in the core (i.e. uncompleted) itemset for a state always has the same symbol before “dot”.  This is exactly the symbol which leads to the state through the “go to” table.

If we know what the previous symbol is at any state, and if we know which symbols can end in a newline token, then we know which states mark the start of a line-like thing.  A symbol which can end with a newline may not always end with a newline as a statement could, for example, end with either newline or a semicolon.  However any location where a newline could appear can reasonably be seen as a boundary between two line-like things.

So when we see a newline token in the parser’s look-ahead buffer, we can examine the current state stack looking at states which start line-like things and looking at internal indents.  If the line-starting state closest to the top is above the top most internal indent, then we cannot ignore newline tokens: there is no indent to protect them.  If, on the other hand, an internal indent is closer to the top of the stack than a line-starting state, then we are currently in an indented continuation of a line-like thing and so newlines should be ignored.

Rather than searching the stack every time we find a newline, we can track whether newlines should  be ignored or not.

  • Whenever we shift a state which is line-starting, the new stack frame is flagged as permitting newlines.  When other states are shifted, we inherit the “newlines_permitted” flag from the previous stack frame.
  • When we reduce some states then if the net indent in all the reduced frames in positive, we record the new top-of-stack frame as ignoring newlines (i.e. newlines not permitted).
  • When we process an OUT token by decreasing the indent count on the top-of-stack frame, we must update the “newlines_permitted” flag depending on the state and remaining indent level.  This  may require copying the flag from the previous level of the stack.

Once we  track the “newlines_permitted” state this way it becomes trivial to handle newline tokens.  If one is seen when “newlines_permitted” on the top of the stack is false, we simply discard the newline.  If “newlines_permitted” is true, we process the newline just like any other token.

Updating the grammar processing to track which symbols can end with a newline token, and which states can immediately follow such a token is fairly straight forward.  The “ends with newline” flag can be computed when the “nullable” flag is computed with a similar repetitive processes.   If the body of any production contains an “ends with newline” symbol followed only by nullable  symbols, then the head of that production is flagged as “ends with newline”.  Tagging the affected states is even more straight forward: if the symbol which lead to a new state “ends with newline”, the new state can start a line.  If not and the symbol is nullable, the new state starts a line if the previous state did.  Otherwise the new state does not start a line.

My parser generator, modified to handle line breaks as described here, can be found in `git://neil.brown.name/ocean/` or a this version.

One last issues seems worth mentioned before we depart this topic.  That related to what should happen if a line-like object does not start at the beginning of a line.  For example:

a = b ; x = y 
   + z
   + q

This would be perfectly legal with the parser as currently described, but looks wrong and is possibly not what is intended.  It seems hard to justify raising an error or changing the obvious parse when this happens, but it is almost certainly cause for a warning, and probably a very stern warning at that.

We would need to track which symbols started at the real start of line (i.e. after NL or IN).  If an indent and the previous “starts line” state was not at the start of a physical line, then a warning should be raised.  I haven’t implemented this yet, but I probably will.


Posted in Language Design | Leave a comment

Parsing a two-dimensional language.

Many programming languages are essentially one dimensional.  The parser treats them simply as a linear sequence of tokens.  A program could all be written on a single line, or with each token on a separate line and the parser or compiler wouldn’t notice the difference.  This set of languages includes Algol, Pascal, C, Rust and many others.

Some languages are 2-dimensional in a bad way. FORTRAN is probably the best example, though BASIC is similar.  These (at least in their early forms) had strict requirements as to what can go on one line and what needs to go on a separate line.

A few languages are exploring the middle ground.  Go will treat Newlines like semi-colons in certain cases which can result in a 2-dimensional feel, but brings with it some rather ad-hoc rules. Python probably makes the best attempt at 2-dimensional parsing of the languages that I have looked at.  It allows newlines to terminate statements and also uses indents to indicate the grouping of some language elements, particularly statement groups.

While I find a lot to like in Python, it seems imperfect.  I particularly dislike using a backslash to indicate line continuation.  You can avoid this in Python by putting brackets around things as a newline inside brackets is ignored.  But this feels like a weakness to me.

As I wrote in a recent article for lwn.net:

The recognition of a line-break as being distinct from other kinds of white space seems to be a clear recognition that the two dimensional appearance of the code has relevance for parsing it. It is therefore a little surprising that we don’t see the line indent playing a bigger role in interpretation of code.

This note is the first part of a report on my attempt to translate my intuition about parsing the two dimensional layout into some clear rules and concrete code.  This note deals with indents.  A subsequent note will look at non-indenting line breaks.

Indenting means continuation, newline means completion.

The two elements that give a section of code a two dimensional appearance are line breaks and indents.  Without the line break there wouldn’t be any extra dimension at all.  Without indents the code would just look like a block of text and we would be in much the same situation as FORTRAN and BASIC.

Each of these has a fairly well understood meaning.  A line break often marks the end of something.  This isn’t always true, and exactly what the “something” is might be open to interpretation, but this meaning is fairly well understood and line breaks rarely mean anything else.

An indent means some sort of continuation.  The line that is indented with respect to a previous line is in some sense part of that previous line.

To help visualise this, consider:

   D E

Here “D E” and “F” are clearly part of the thing which starts “A B C”. “G H” may or may not be, much as “F” may or may not be part of “D E”.  However if “D E” or even “B C” started something, then it definitely finished by “F”.  “G” might be part of something that started a “A”, but if some subordinate thing started at “B” or later, it does not continue into “G”.

As a first step to formalising this we need to be able to encode this two dimensional structure into the stream of tokens which the parser can then process.

Generating a “NEWLINE” token at the end of each line, and maybe a “SPACE” token for each space at the start of a line could capture it but not in a very helpful way.  It is not really the presence of an indent that is important but rather the change in indent.  So between “C” and “D” we need to know that the indent level increased, while between “F” and “G” it decreased.

Python uses the terms “INDENT” and “DEDENT”.  The first is technically inaccurate as every line is indented, not just the first.  The second is a neologism presumably derived by analogy with “increase” and “decrease” or similar.  Other options might be “UNDENT” or “OUTDENT”.  I’m not sure I really like any of these.  However “in/out” seems the best fit for what is happening so I’ll just call the tokens which report indent changes “IN” and “OUT”.

Using these two tokens, and ignoring the non-indenting line breaks for now, the above structure can be tokenized as:


This is very simple as there is only one indented block. A more complex example is:

  D E
   F G

This would be tokenized as:



Note the sequence “OUT OUT IN” – we close two indented sections and open another one all at one line break.  This is because “H” has a lesser indent than “D”, but a greater indent than “A”.

This tokenization captures the structure very effectively and is what I will be using. Unsurprisingly, these are exactly the tokens that can be generated by the scanner I previously described though the token names have been changed.

What to do with indents?

Now that we have these IN and OUT tokens we need to understand what to do with them.  Their significant value is that they are very strong syntactic markers.  IN and OUT are strongly paired: it is very obvious which OUT matches which IN.  This contrasts with bracketing markers like “{ }” or BEGIN and END or DO and OD.  The latter can be accidentally deleted or forgotten.  If OUT is forgotten, that omission will be obvious on every line.

I’ve recently be writing C code using the “text” mode of my editor (emacs) rather than the C mode.  This is because I’ve been experimenting with literate programming and writing the code inside `markdown` markup.

I have really noticed that lack of help with auto-indenting.  Each line is auto-indented to the same level as the previous line, which is certainly useful.  But typing `}` doesn’t automatically drop back one level of indent.  So sometimes I get it wrong.

When I do get it wrong, the compiler complains (as it should) but it is usually a cryptic complaint that is way too late.  I would much rather it know there is a problem as soon as the indenting looks wrong.  So error detection will be the first focus for exploring the use of indents.

To discuss how this works you really need a working knowledge of grammars and parsing.  If you don’t have that yet, let me suggest my earlier note of LR grammars. If you don’t have time for that now, then press on anyway, it shouldn’t be too hard.

Unlike other tokens used in an LR grammar, IN and OUT cannot simply appear in the grammar where ever they are expected.  This is because there are really too many places that an IN can appear.  An OUT must be at the end of something, but that matching IN can often be anywhere in the middle.  So the parser will need to have direct knowledge of IN and OUT tokens.  It will not “SHIFT” them onto the stack as it does with other tokens.  It has to handle them directly.

As an OUT enforces the end of something the natural thing to do when an OUT token is seen in the look-ahead is to perform REDUCE actions until the OUT is balanced by some IN that was previously seen.  This is clearly a very imprecise rule.  We need to explore some examples before we can make the rule suitably precise.

We can start looking at examples supposing a grammar containing:

Statementlist -> Statementlist Statement
Statement -> if Condition : Statementlist
           | print String

A program fragment like:

if a == b :
   print "Hello"
   print "World"
print "End"

would then result in the following LR stack just as the OUT token appears in the look ahead:

if Condition : Statementlist print String   [OUT]

The IN appeared just before the Statementlist and the only possible reduction is to reduce”Statement -> print String” so we would do that which results in

if Condition  : Statementlist Statement    [OUT]

Again we can reduce and still stay after the indent so now we reduce “Statementlist -> Statementlist Statement” and get to:

if Condition : Statementlist   [OUT]

Now the symbol on the top of the stack is where the indent started, so we need to stop reducing.  It might seem appropriate to reduce this fully to a “Statement” but we cannot do that based on just the indent.  Consider a different language with a fragment like:

if ( a == b ) {
    print "Hello";
    print "World";

In this case it is clear that the OUT itself is not enough to close the whole statement.  Resolving this will have to wait until we explore how line-structure works.

So the approach for handling IN and OUT seems to be that when we see an IN, we mark the next symbol SHIFTed as being indented.  Then when we see OUT we stop SHIFTing and REDUCE until the symbol on the stop of the stack is marked as indented.  We can then clear that marking, discard the OUT, and continue.

This is close to a working solution but there are a few complications that need to be resolved first.  One relates directly to the fact that “Statementlist” is recursive.

Statementlist -> Statementlist Statement

This means that reducing until we just have “Statementlist” on the top of the stack doesn’t stop more things being added to that Statementlist. The idea was that the OUT should complete close the Statementlist but because of its recursive nature that isn’t really possible.

This is easily fixed by introducing a second “dummy” nonterminal.

Statementlist -> _Statementlist
_Statementlist -> _Statementlist Statement

This set of productions will parse exactly the same language but will introduce an extra state.  If we see an OUT token we can reduce back to a “_Statementlist” as we did before, and then go one more step and reduce to the new “Statementlist”.  Once there we really have closed the list.  Another statement cannot extend it like it could in the simpler grammar.

These extra “dummy” non terminals could easily be added by the parser generator itself.  However they solve one problem by providing another.  Previously we could simply REDUCE until the top of stack carries an IN, and then cancel the IN.  Now we might need to REDUCE one more step – the “Statement -> _Statement” production above. How can we know when to stop?

The answer to this requires that we store one more bit (literally) of information in the state stack.  We currently have a count of the number of indents which appear “within” (the full expansion of) the symbol in the stack entry.  By that we mean the number of terminals which were reduced into this symbol and were immediately preceded by an IN. To that we must add a flag to record whether the first terminal was preceded by an IN.  i.e. is the whole symbol indented or just part of it.

Now the rule will be that if we have an OUT token in the lookahead buffer and the top-of-stack token contains an IN that was not at the start then we can simply cancel the IN and discard the OUT.  Further, if the top-of-stack contains only one IN and it was at the start then we need to look at the production that we can reduce.  If that production contains more than one symbol in its body then again we can cancel the IN and discard the OUT.  In this case, that IN would be not at the start of the symbol we would reduce to, so it is somewhat similar to the first case.

UPDATE: The reasoning in the above paragraph is flawed.  It talks about looking at “the production that we can reduce”.  That isn’t well defined (there may not be anything to reduce at some states) and it isn’t really the right thing to be talking about.

The important production is the one that will ultimately be reduce to consume the symbol on the top of the stack.  If that production starts earlier than the top-of-stack symbol, the it is perfectly OK to cancel the IN with the OUT in the look-ahead as the top-of-stack symbol is exactly the text that was indented.  If the production consists of precisely the top-of-stack symbol (i.e. the body of the production only has one symbol), then now is clearly the time to reduce that production, and we need to ask this question again.  If the production which will ultimately consume the top-of-stack symbol starts with that symbol but needs other symbols too, then we have a problem.  Those other symbols will be at a lower indent level than the start of the top-of-stack symbol, and that isn’t allowed.  In that case we need to raise an error and find a way to reduce that product early.

A serious problem with the above is that we don’t necessarily know what production that ultimate one will be.  LR shift/reduce parsing doesn’t identify a production until it is time to reduce it.  So we need a different approach.  If we ensure that all recursive non-terminals have a dummy parent non-terminal as described above, that approach is fairly easy, but does require extra information in each state.  We need to know which states have any item with ‘DOT’ after the first symbol.  If we find ourselves in such a state and it cannot reduce that first symbol, then we know that there is a real possibility of the indent being interpreted wrongly and need to raise an error.


Another way to look at this is to say that when we see an OUT token, we reduce until the next reduction would contain an internal (i.e. not at the start) IN.  Once we find this internal IN we cancel it and discard the OUT.

Indenting affects error handling in two ways.  Firstly we need to know what happens if an OUT implies a reduction is needed, but the current state does not allow a reduction.  Secondly we need to know how to handle IN and OUT tokens if they are found while discarding input tokens during normal error recovery.

In the first case, the fact that we want to reduce suggests that we must have part of the right hand side of a production but not all of it.  It would be nice to keep what we have, so to complete it we need to synthesize the extra tokens needed to make a complete  production.  If there is no reducible production at this state then there must be symbols that can be shifted.  So we can synthesise one of these – the one  that gets us as close to the end of a production as possible.  This may result in an ‘unknown variable’ or similar which would need to be carefully suppressed, but would allow any valid units with the production so far to be properly analysed.

UPDATE: Synthesizing a symbol might sound easy, but it isn’t.  As well as the symbol we need to synthesis the data (typically part of an Abstract Syntax Tree) that needs to accompany it.  Synthesizing that automatically is not possible.  A possible approach is to use productions like “SomeSymbol -> ERROR” to synthesize symbols.  The reduction code with that production can synthesize the necessary AST data.  This may introduce other complications.  END-UPDATE.

For the second case we can simply count the INs and matching OUTs until we find a SHIFTable symbol or find more OUTs than INs.  If we find a SHIFtable symbol, we set the IN count for that symbol to the net number of INs.  If we find an extra OUT it means that we must have come to the end of whatever high level non-terminal was in error.  Then we just leave the ERROR token on the stack without shifting anything else and move on to normal handling for an OUT token.  This will either reduce whatever is on the stack, or will synthesize some token to allow parsing to continue.

It should be noted here that I haven’t experimented a lot with this error handling process.  It seems to make sense, but often things which seem to make sense actually don’t.  So this might need to be revised.

This then is the complete description for how to handle IN and OUT tokens.  In summary:

  1. Each frame in the LR state stack must contain a count of the number of IN tokens it contains, and a flag to record if one of these is at the start.
  2. When we see an IN token we record the fact and discard it.  When we SHIFT in a new terminal we check if we just saw an IN.  If so the new entry is the stack records that one IN is present, and that it is at the start.
  3. When we reduce a series of symbols to a single non-terminal we add up the number of included IN tokens in each and use that as the count for the new token.  The “one at the start” flag from the first symbol is inherited by the new symbol.
  4. When we see an OUT token we check if the top-of-stack contains a “not at the start” IN token, or if it contains an “at the start” IN but the next reduction would make it “not at the start”.  If either of these are true we decrement the IN counter and discard the OUT.  Otherwise we REDUCE a production if we can, and synthesize some suitable  symbol to SHIFT if we cannot.
  5. When we discard stack states during error handling, we count the number of indents and add them to the “ERROR” state that we ultimate SHIFT.
  6. When we discard look-ahead tokens during error handling we keep count of any IN and OUT tokens.  If we see an OUT token when the count is zero, we assume a sane state has been found and return to where step  4 above was handled.  Otherwise the IN count of the token which ultimately resolves the error is set to the balance of INs and OUTs.

The net result of this is that an OUT always closes anything that was started at or after the matching IN, but nothing that was started before the matching IN.

This adjunct to LR parsing can be used in two different but related ways.  Firstly as already noted it can improve error reporting.  If a ‘closing’ token like “}” is missing then that will be noticed immediately.  The syntactic unit that it is meant to close (e.g. Statementlist) will already have been reduced by an OUT and if the “}” doesn’t follow a regular LR parsing error will occur.

Secondly, and more interestingly for language design, an ambiguous grammar can be cleanly resolved by treating indentation as meaningful.

The classic “dangling else” case can be easily resolved by the indentation of the ‘else’.

if (COND1)
    if (COND2)

is no longer ambiguous.  The “else” clearly relates to the first “if”, not the second.

Ocean will make use of this to have a syntax for “blocks” of statements which does not require a closing token.  They will be closed by a reduction in indenting.  As explicit closing can sometimes be preferred, Ocean will also allow C-style blocks.  Possibly that should be “go-style” blocks as a block must not be a single statement.

The syntax I’m experimenting with at the moment includes:

Block -> : Statementlist
       | { Statementlist }
Statement -> if Expression Block
           | if Expression Block ElsePart
           | ...
ElsePart -> else Block
          | else if Expression Block
          | else if Expression Block ElsePart

This allows structured statements that look like Python or like Go, and will use indentation to resolve ambiguities in the Python style, and report errors in the Go style.

My intention is that several other structures will allow either “{}” or “:..” formats.  “struct” type declarations is an obvious one.  Others might occur as design proceeds.  However before we can proceed with those details we need to finish the two-dimensional parsing which is only half done.  This note dealt with indents, the next note deals with end-of-line marker.

Meanwhile code that manages indents as described here can be found in `git://neil.brown.name/ocean` or this version of the parser generator.

Posted in Language Design | 1 Comment