Lexical Structure

It is traditional to divide the parsing of a program into two conceptually separate stages – the lexical analysis which extracts a stream of tokens from a stream of characters, and the grammar matching stage which gathers those tokens into grammatical units such a declarations, statements, and expressions. Recognising the value of tradition, Ocean will similarly have separate lexical and grammatical stages. This note will focus on the lexical stage.

Future language design decisions will refine may details of the lexical structure of the language, however there is a lot of commonality among current languages and we can build on that commonality to establish a base approach to lexical analysis which can then be fine tuned when the details of the language are known.

This commonality suggests that the grammars of programming languages are comprised of identifiers, reserved words, special symbols, numeric literals, and string literals. Comments need to be recognised but don’t form tokens for the grammar. White space is mostly treated like comments – recognised but not tokenised. However line breaks and leading indents sometimes are significant in different ways, so we need to decide whether and how they might be significant in Ocean.

Identifiers

Identifiers are mostly strings of letters and digits, though they don’t start with a digit. Sometimes other characters are allowed. This pattern is most easily summarised as a set of characters that can start an identifier and another set of characters that can continue an identifier. Fortunately Unicode provides well defined character classes “ID_Start” and “ID_Continue” which are suitable for exactly this purpose, providing we base the language on Unicode.

While ASCII was the old standard and is still very useful, Unicode does seem to be the way of the future and it would be foolish to ignore that in the design of a new language. So an Ocean program will be written in Unicode, probably in the UTF-8 encoding (though that is an implementation detail) and will use ID_Start and ID_Continue as the basis for identifiers.

Beyond that it may be desirable to add to these character sets. ID_Start does not contain the underscore which many languages do allow. Neither contain the dollar sign which, for example, Javascript allows in identifiers.

At this stage it seems best to decide that the lexical analysis engine should take a list of “start” characters and a list of “continue” characters to be added to ID_Start and ID_Continue, which can then be used to create identifiers. “start” will contain at least the underscore. The fate of ‘$’ and any other character will have to await future language decisions.

Reserved Words

Reserved words are a specific subset of the class of words that can be used as identifiers. They are reserved for grammatical use, such as “if” and “while” and “else”. We cannot know at this stage what the list will be, just that there will be a list which will need to be provided to the lexical tokenizer. These words must only contain valid identifier characters.

Special Symbols

These are like reserved words, but they are made up from characters other than those used in identifiers (or those which mark numbers and strings which come later). One could imagine having tokens which combine identifer-characters and non-identifier characters, such as “+A”, but that seems like a bad idea. It is more natural to see that as two tokens “+” and “A”, and to let the grammar join them together as needed.

There is an important difference between reserved words and special symbols. Words that aren’t reserved can still be valid identifiers, while symbols that aren’t special are probably illegal. So if “while” is a reserved word but “whilenot” isn’t, then the later is simply an unrelated identifier. However if “=” is a special symbol and “=/” isn’t, then the later is not valid as a single token at all, and we must parse it as two tokens “=” and “/”.

So for identifiers and reserved words, we take the longest string of characters that is valid for an identifier and then decide if it is actually a reserved word. For special symbols we take the longest string of characters which is a valid symbol and stop the current token as soon as we see a character which would not result in a valid token. Thus any prefix of a special symbol must also be a special symbol. This should not be a hardship as longer symbols are only chosen when the length is needed to distinguish from something shorter.

So together with the list of reserved words, the lexical engine will need a list of special symbols. These will probably be ASCII rather than full Unicode. While permitting Unicode is important, requiring it would probably be inconvenient.

Numeric literals

There are two questions we need to resolve for numeric literals: what character strings can be used, and what do they mean. We only need to answer the first here, but it might be useful to dwell a little on the second too.

I really like the idea in the Go language that numeric literals don’t carry strong type information. They are primarily a value and a value can be used where ever the value fits. So “1e9” and “4.0” might be integers, and “34” might be a float. I intend to copy this idea in Ocean and possibly carry it even further so that “34” and “34.0” are completely indistinguishable to the rest of the language (Go can see a difference in a some cases).

Given my principle that any feature available to built-in types should also be available to user-defined types, it follows that if a numeric literal might be useful as an integer or a float, it should also be usable for any other type. An important example is the “complex” type which might benefit from literal constants. To express these fully we need to distinguish the “real” from the “imaginary” components and this is typically done using a trailing ‘i’ or ‘j’.

Generalising this suggests that a numeric constant should allow a short suffix which can be used by user-defined types in arbitrary ways. The exact details of this will have to be left for a later stage of the language design. For now, numeric constants will be allowed 1 or 2 following letters, which as yet have an unspecific meaning.

There is plenty of precedent to follow for the numeric part:

“0x” means that the following digits are is hexadecimal.
“0b” means following digits are binary.
A leading “0” implies octal, unless there is a decimal pointer or exponent.
An exponent is “e” or “E” followed by a decimal number which multiplies the preceding number by ten to the given power.
A “decimal point” can be use to specify a floating point number.

Not all of this precedent is necessarily useful though. In particular the leading “0” meaning “octal” is an irregularity in the design as it is really “a leading 0 followed only by digits”. This usage is also error prone as regular mathematics ignores leading zeros. So Ocean will follow Python 3 and other languages and use “0o” for octal, making a leading zero only legal when followed by an ‘x’, ‘o’, b’, or ‘.’.

Other possibilities don’t have such strong precedents but are still useful. For example, “0x” can also introduce a floating pointer number if a hexadecimal point, or a “pNN” suffix gives a power of 2. There is also the question of whether a decimal point can start a number, or whether a digit is required.

Another possibility to add to the mix is that as long strings of digits can be hard to read it is sometimes permitted to insert underscores to aid readability, much as spaces are often used when writing large numbers in natural-language text. Some languages allow the underscores anywhere, others impose stricter requirements. For example in Ceylon a decimal number can only have them every 3 digits, and in hexadecimal or binary they can be every 2 or ever 4 digits.

Finally, some languages allow negative literals, whether by including a leading “-” in the number token, or by allowing a leading “_” to mean a negative literal.

To meet the goal of being expressive, Ocean should allow as many different ways to encode numbers as could be useful. To meet the goal of minimising errors, it should exclude possibilities that could be confusing and don’t actually add expressivity.

So Ocean will allow “0x” hexadecimal numbers, “0o” octal numbers, and “0b” binary numbers, with or without a “decimal” point and a “p” exponent, and decimal numbers with or without a decimal point and an “e” exponent. A decimal can only start with a zero if that is immediately followed by a decimal point and in that case the zero is required: a number cannot start with a point. The “p” or “e” exponent marker can also be “P” or “E”, and the number following it is in decimal, with no leading zero, but possibly with a leading + or -.

As this is an experiment as much as a language I will try allowing the use of a comma for the decimal marker as well as a period. This means that commas used in lists might need to be followed by a space. Requiring that is a bit ugly, but actually doing it is always a good idea. If this turns out to be problematic it can be changed.

All of these can have single underscores appearing between any two digits (including hexadecimal digits). As another experiment, spaces will be allowed to separate digits. Only a single space (as with underscores) and only between two digits. Though this might seem radical it is less of a problem than commas as numbers are almost always followed by punctuation in a grammar.

All numeric literals are non-negative – the grammar will allow a prefix negation operator which will be evaluated at compile time.

This numeric part will be evaluated precisely such as can be achieved with the GNU “libgmp” bignum library. Conversion to machine format will only happen at the last possible point in language processing.

String Literals

String literal are simply arbitrary sequences of characters surrounded by some quote character. Or at least, almost arbitrary.

It can improve language safety to not allow literal newline characters inside string literals. Thus a missing close quote is easily recognised. When multi-line string constants are required, one effective approach is the triple-quote used by Python and other languages. This is common and well understood, so Ocean will borrow the idea.

Inside a string, escapes may or may not be permitted. These are typically a back slash (“\”) followed by one or more characters. The minimum requires would be “\n” for a newline, “\t” for a tab, and “\ooo” with octal digits for an arbitrary ASCII character. We could go to extremes and allow “\U{Name of Unicode Character}” which allows the use of long Unicode character names.

This is an area where extension might be useful so it is important that any construct that might be useful in the future should be illegal at first. So all escapes not explicitly permitted are in error.

One particular feature of strings that I personally have never liked is that to include a quote inside a quoted string, you escape the quote. This means that the simple rule “characters between quotes make a string” doesn’t apply and any parsing engine must recognise the escape character. As this is my language I will chose that quotes cannot be escaped. Instead Ocean will allow “\q” which will expand to whichever quote character starts and ends the string.

The escape sequences understood will initially be:

“\\” – the “backslash” escape character.
“\n” – newline
“\r” – return
“\t” – tab
“\b” – backspace
“\q” – quote
“\f” – formfeed
“\v” – vertical tab
“\a” – alert, bell.
“\NNN” – precisely 3 octal digits for an ASCII character (0-377)
“\xNN” – precisely 2 hexadecimal digits
“\uXXXX” – precisely 4 hexadecimal digits for a Unicode codepoint.
“\UXXXXXXXX”- precisely 8 hexadecimal digits for a Unicode codepoint.

There are three quote characters that are sometimes used: single quote, double quote, and back quote. Ocean will almost certainly allow all three. The lexical analysis code will certainly recognise any of the three that haven’t been claimed as a special symbol. Though I’m leaving this open at present, I expect that strings surrounded by backquotes will not have any escape processing done, while the other strings will.

If two quote symbols appear together, that is simply an empty string. If three appear, then it is a multi-line string. The next character must be a newline and the closing triple quote must appear on a line by itself. The newline after the first triple quote is not included in the string, the newline before the second triple quote is.

Quoting is permitted inside multi-line strings in the same way as with single-line strings, so triple-backquote should be used to avoid all escape interpretation. In a multi-line string that allow escapes, an escape character at the end of the line hides that newline character. This is the only way to achieve a multiline string that does not end with a newline character.

With multi-line strings comes the question of indenting. It would be nice to be able to indent the whole string to line up with surrounding code, and to have the indent stripped when the parsed token is turned in to a string constant. But the key question is : how exactly to specify what to strip.

One option would be to strip the same indent as was on the line that contains the start of the token. However it is nice for values to be more indented than – say – the assignment that uses them.

Another option would be to strip the same indent as was on the line that contains the end of the token. This places the programmer in more control and probably makes lots of sense. It would only be safe to strip the exact sequence of characters that precede the close quote. So if there is a mix of tabs and spaces, it must be the same mix for every line.

I’m not 100% comfortable with this, but it is certainly worth a try, so it will be how the first attempt works.

Comments

There are three popular styles of comments:

/* Block comments from C and others */
// single-line comments from C++ and BCPL
# single-line comments from C-shell and Bourne shell.

While others exist, these seem to be most popular, and others don’t add any extra functionality. For the block comment, it is important to note that nesting is not permitted. That is: the character pair “/*” may not appear inside a comment. Allowing it does not significantly improve expressivity, and excluding it guards against forgetting or mis-typing the closing sequence “*/”.

It seems best to simply allow all of these forms of comments. The only negative impact this might have is that it means the 3 introductory notations could not be used elsewhere, and of these, the “#” is the only symbol with significant uses in existing languages. Where it is used, it is often for extra-language functions such as macro pre-processing. Features provided by these extra-language notations are probably best included in the main language in some way. So at this stage, the plan for Ocean is to treat all three of these forms as comments.

Line breaks and indents

The use of context-free grammars to describe languages led to line breaks being largely ignore as syntactic elements. Rather a program is seen as a linear sentence in the language, and spacing it just used to improve readability.

However this approach allows the possibility of the visual formatting becoming out-of-sync with the actual structure of the program. Further it results in syntactic elements, particularly semicolons, which are required by the language but appear unnecessary to the human reader.

Various approaches to address this have been tried in different languages. Of these I find that in Python the nicest. It may not be perfect, but recognising both the presence of line breaks and the level of indents seems valuable.

We will explore this issue more deeply when we look at using a grammar to parse the token sequence, but for now we need to know what tokens to include, and what they mean.

The key observation needed at this stage is that when a block of text is indented, it is somehow subordinate to the preceding line that was not indented. Thus that first line doesn’t completely end until all the indented text has been processed.

So every line break will generate a “newline” token, however if subsequent lines are indented, the “newline” token will be delayed until just before the next line which is not indented. Together with this, “Indent” tokens will be generated at the start of each line which is indented more than the previous line, with matching “Undent” tokens when that indent level terminates. In consequence of this, indenting can be used to bracket syntactic objects, and to whatever extent newlines terminate anything, they only do so after any indented continuation is complete.

So for example

if A
   and B {
       X
       Y
       while C do
            Z
}

would parse as

if A INDENT and B { INDENT X NEWLINE Y NEWLINE while C do INDENT Z
NEWLINE UNDENT NEWLINE UNDENT NEWLINE UNDENT NEWLINE } NEWLINE

Note that INDENT and UNDENT come in pairs, and the NEWLINE you would expect before an INDENT comes after the matching UNDENT. This will allow a multiple UNDENT, such as that following the “Z” to terminate a number of syntactic units.

Interaction with literate programming

As previously discussed an Ocean program is, at least potentially, presented as a literate program which may have several code blocks from various parts of the document. The above discussion of lexical analysis has assumed a simple sequence of characters, while in actual fact will have a sequence of blocks of characters. Does that make a difference?

As blocks are always terminated by newlines, and as indenting is carefully preserved across blocks the only issues that could arise would be with lexical elements which are contain newlines, but aren’t newlines themselves. Of these there are two.

Firstly, block comments can contain newlines. Therefore, is it acceptable for a comment that starts in one code block to be continued in another? Almost certainly this would be a mistake and would lead to confusion. It is possible that it might be useful to “comment out” a block of code, however if that is important, then the language should have more focused means to achieve the same without confusion. Our literate programming syntax makes it possible to mark a block as an “Example” to not be included in the code, and a section reference line can easily be commented out with a single-line comment. So comments must not span literate program code blocks.

Secondly, multi-line strings can contain newlines. For much the same reasons as with comments, these strings will not be permitted to span literate program code block. The fact that we strip the indent off the closing quote from all lines makes interaction with literate programming much cleaner than leaving all the indentation in.

Trying it out.

Having lots of ideas without any code isn’t much fun. So to accompany this blog is my first release of a token scanner – written as a literate program of course.

It is somewhat configurable so that if I change my mind about spaces in numbers they are easy to remove.

This program is one of the reasons that this post took so long to arrive. I had written a scanner prior to staring in the post, but wanted to re-write it as a literate program and wanted to incorporate ideas I had gained since writing it. This took quite a while as I’m still new the the literate programming approach and my first draft had a rather confused structure. However I’m really happy with the result and I think the code and design are better for the extra effort put in.

Identifiers

Reserved Words

Special Symbols

Numeric literals

String Literals

Comments

Line breaks and indents

Interaction with literate programming

Trying it out.

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

OpenPhoenux