Literate programming?

I was going to start out by describing the lexical structure of Ocean, but as I thought more about it, I realised there was something that needed to come first.  And that something is literate programming.

Literate programming is the idea – popularised by Donald Knuth – of writing programs like pieces of literature.  They can be structured as a story or an exposition, and presented in a way that makes sense to the human rather than just to a computer.

The original implementation – known as “web” with two tools, weave and tangle – took a source document and converted either to TeX for elegant printing, or to Pascal for execution.  So the “web” input was not read directly by the compiler.

I first came across the idea of the language being prepared for literate programming with the language “Miranda”.  A Miranda program can come in a form where most text is treated as comments, and only suitably marked pieces of text (with a “>” at the start of the line) is treated as code.  I like this idea – the language should acknowledge the possibility of literate programming from the start.  I want Ocean to acknowledge it too.

Obviously a language cannot enforce a literate style, and I don’t think I would want it to try.  I am a big believer in “write one to throw away”.  The first time you write code, you probably don’t know what you are doing.  You should consider the first attempt to be a prototype or first draft, and be ready to throw it away.  There probably isn’t a lot of point trying to weave a prototype into a story.

But once you have working code and have got the algorithms and data structures sorted out in your mind, I think there is a lot of value in re-writing from scratch and presenting the whole more clearly – as a piece of literature.  So I see two sorts of programs: prototypes that just have occasional comments, and presentation code which is written in full literate style.  I find that having an audience to write to – even an imaginary one – motivates me to write coherently and close the gaps.  Presenting code as a piece of literature may equally motivate good quality.

Having two different forms at first suggested to me the idea of two different file suffixes to maintain a clear different, but I no longer lean that way. It might be nice to allow two different suffixes so a programmer could create “hello.ocd” – the “OCrean Draft” first, and then create “hello.ocn” the final program as a separate file, but having the compiler impose different rules on the two files would likely make programmers grumpy, or at least confused.  So as long as there is a simple syntax marker at the top of the file which effectively says “this is all code”, that should be all that the compiler needs to care about.

The  key elements of literate programming are to have a language for presenting the text, a way to extract the code from among that text, and some sort of cross-reference mechanism for the code.

This last is important because presenting code as a story doesn’t always follow a simple linear sequence – it is good to be able to jump around a bit.  We might want to tell the overall story of a function, then come back and explain some detail.  The extra detailed code should appear with the extra detailed explanation.  It may be possible to use language features to do this, but at this stage of the design I am far from certain.  So I want the literate-programming model to allow rearrangement of code through cross references.

While TeX has many advantages for presenting text and produced beautiful output, I do find it a bit clumsy to use.  It is maybe too powerful and a little bit arcane.  A popular language for simple markup currently is “markdown”.  The input is quite easy to read as plain text, and there are plenty of tools to convert it into HTML which, while not particularly elegant, is widely supported and can be tuned with style sheets.

“markdown” has the added advantage of have a concept of “code blocks” already built in so it should be easy to extract code.  Using “markdown” for literate programming is not at all a new idea.  A quick search of the Internet finds quite a few projects for extracting executable code from a markdown document.  Several of them allow for cross references and non-linear code extraction, but unfortunately there is nothing that looks much like a standard.

This lack of a standard seems to be “par for the course” for markdown.  There is some base common functionality, but each implementation does things a bit differently.  One of these things is how code sections are recognised.

The common core is that a new paragraph (after a blank line)  which is indented 4 or more spaces is code.  However if the previous paragraph was part of a list, then 4-spaces means a continuation of that list and 8 spaces are needed for code.   In the Perl implementation of markdown, lists can nest so in a list-with-a-list, 12 spaces are needed for code.

Then there are other implementations like “github markdown” where a paragraph starting with ``` is code, and it continues to the next ```.  And pandoc is similar, but code paragraphs start with 3 or more ~~~ and end with at least as many again.

On the whole it is a mess.  The github and pandoc markings are probably unique enough that accepting them won’t conflict with other standards.  However while the Perl markdown recognises nested lists, the python markdown doesn’t.  So code extracted from markdown written for one could be different from markdown written for the other.

But if I want Ocean language tools to be able to work with markdown literate programs (and I do) then I need to make some determination on how to recognise code.  So:

  • A paragraph starting ``` or ~~~ starts code, which continues to a matching line
  • An indented paragraph after a list paragraph is also a list paragraph.
  • An indented (4 or more spaces) paragraph after a non-list paragraph is code

This means that code inside lists will be ignored.  I think this is safest – for now at least.  If experience suggests this is a problem, I can always change it.

This just leaves the need for cross-references.  Of the examples I’ve looked at, the one I like that best is to use section headings as keys.  markdown can have arbitrarily deep section headings so a level 6 or 7 section heading could cheaply be placed on each piece of code if needed.  So I’ll decide that all code blocks in a section are concatenated and labelled for that section.

While cross references are very valuable, it can sometimes be easier to allow simple linear concatenation.  An obvious example is function declarations.  Each might go in a separate section, but then having some block that explicitly lists all of those sections would be boring.  If anything, such a “table of contents” block should be automatically generated, not required as input.  So it will be useful if two sections with the same, or similar, names cause the code in those two (or more) sections to be simply concatenated.

It is unclear at this stage whether it should be possible to simply reuse and section heading, or whether some sort of marker such as (cont.) should be required.  The later seems elegant but might be unnecessary.  So in the first instance at least we will leave such a requirement out.  Section names for code can be freely reused.

Inside a code block we need a way to reference other code blocks.  My current thinking is that giving the section name preceded by two hash characters (##)  would be reasonable.  They are mnemonic of the section heading they point to, and are unlikely to be required at the start of a line in real code.

These references are clearly not new section headings themselves as they will be indented or otherwise clearly in a code block, where as section headings are not indented and definitely not in code blocks.

So this is what literate Ocean code will look like: markdown with code blocks which can contain references to other code blocks by giving the section name preceded by a pair  of hashes.

A non-literate Ocean file is simply one which starts with ``` or ~~~ and never has a recurrence of that string.

To make this full concrete, and as a first step towards working code in which to experiment with Ocean, I have implemented (yet another) program to extract code from markdown.  Following my own advice I wrote a prototype first (in C of course, because I can’t write things in Ocean yet) and then re-wrote it as a literate program.

I found that the process more than met my expectations.  Re-writing motivated me to clean up various loose ends and to create more meaningful data structures and in general produced a much better result than the first prototype.  Forcing myself to start again from scratch made it easier to discard things that were only good in exchange for things that were better.

So it seems likely that all the code I publish for this experiment with be in literate style using markdown.  It will require my “md2c” program to compile, until Ocean is actually usable (if ever) in which case it will just need to Ocean interpreter/compiler which will read the markdown code directly.

I find that my key design criteria of enhancing expression and reducing errors certainly argue in favour of the literate style, as the literate version of my first program is both easier to understand and more correct than the original.

This program can be read here or downloaded from my “Ocean” git repository at git://ocean-lang.org/ocean/.

This entry was posted in Language Design. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *


+ five = 14


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>