Local variables and scope

While I was writing test code for my first toy language I particularly noticed the lack of interesting data structures:  to write interesting loops you can do a lot more if you can build data structures while doing it.  So my second toy will begin to introduce data types. It won’t be until a later iteration before we get structures though.

The first step to this is knowing  how to declare variables, or local bindings.  Then I’ll have something to attach types of data structures too.  So this installment is about local variables and their scope.

I started writing this 3 years ago and am only publishing it now.  Sometimes life works like that.  I had to wait until the code worked and I managed to lose interest for a while and get distracted by other things.  At the recent linux.conf.au I went to a talk about the language “Pony“.  While I’m not thrilled with pony (I don’t currently think that expressions and statements are interchangeable), the talk inspired me to get back to ocean…  So what can we say about scopes?

Why we need scope

Several languages don’t bother too much with the “scope” of variables.  Such languages usually have just two scopes – local to a function, and global.  In Python and TCL, variable assignments only create a local binding unless the name is declared to be global, in others such as the “bash” shell, all assignments have global scope unless explicitly declared to be local.  In these language, accessing a binding will prefer a local binding if there is one, and otherwise will make use of a global binding.

While this makes it easy to write code (no need to pre-declare everything), it also makes it easy to write bugs and harder to read the code.  Typos in variable names can be hard to detect, and the intent is not always clear.

Some of the problems can be reduced by static analysis which requires that a variable always be set before it is used, and possibly always be used after it is set. This helps reduce the bugs that are written, but it doesn’t make the code much easier to read.

Having clearly defined, low overhead scoping rules should make it easy to write code, easy to read code, and harder to write bugs.  That is my goal.

Variable declarations

The variable declaration style of C has a certain elegance, but it does not scale well.

int anumber;
char *string;

is quite readable and easy to parse.  But once you introduce typedefs, the type name (“int” or “char” here) becomes just another identifier so you need to know whether an identifier refers to a type or something else to be able to parse correctly.

Being able  to declare a list of integer variables as:

int first, second;

is nice enough.  For more complex types it becomes much less elegant.  A list of pointers to functions returning char pointers  is

char *(*fun1)(), *(*fun2)(), *(*fun3)();

So C isn’t a good model to follow.  The other model that is familiar to me comes from Pascal

name : type-expression

In Pascal, these declarations must come in a special “var” section, but as C has shown, that isn’t really necessary.  For Ocean, I will combine this with assignment to create a simple statement that can declare bindings and give them an initial value.

name : type = value
name :: type = value

The first form declares the name to have the given type, and binds it to “value“.  The second form creates an immutable binding: the name cannot be assigned a new value or otherwise modified.

In each case the “type” can be left out in which case it will be deduced from context, possibly just from the given value, possibly also from how it is used.

found := True
pi ::= 3.14159
radius := 12

found” is clearly Boolean, “pi” is probably float (or double), but “radius” could be an integer or floating point or even complex, depending on what else is done with it.

It is probable that I need to allow the value to be absent too, so “foo : myrecord” might be a legal declaration, as providing a complete initial value may not be practical. For the moment I will assume a value is always required but will revisit that assumption when structs/records and arrays are introduced.

Here the “:=” or “: type =” should be enough to help the reader (and the compiler) know that this is a new variable  (or binding).  If there is no such declaration near by, the name must be global (assuming we permit global names).  If there is, then the name is local.

Typos in name usage will easily be caught because there will (most likely) be no declaration for that name.  Typos in name declarations will likely be caught because a name will be used without having been defined.  There are more subtleties that need to be considered though, firstly the issue of “hole in scope”.

No hole-in-scope

A “hole-in-scope” happens when a binding is declared for a name that is already bound.  As long as the new binding is active, the original binding is hidden and so there is a hole in the scope of that outer binding.  This is quite permissible in C, where a local declaration can use the same name as the global declaration, or a local declaration in a containing statement block. So the following is perfectly legal.

int foo;
func() {
   char *foo;
   if (foo) {
       float foo = 1.0;
   }
}

Such behaviour is, I believe, unsafe.  It makes it too easy to use one binding when you meant to use another.  If you are very familiar with the code such mistakes are unlikely to happen, but if the code was written by someone else, or by yourself some time ago, then mistakes are easy.

Disallowing holes-in-scope sounds like a good idea, but it has an important implication.  It means that the set of names in the global context is never allowed to change.  If it does, a working program might break.  This would happen if the previously working program bound a name which was subsequently introduced into the global scope.  The internal binding would create a hole in the scope of the global name, which cannot be allowed.

This issue extends beyond fixed global names – which should be kept to a minimum anyway.  If we had a python-like mechanism for importing modules, such as

from MyModule import *

then it would become impossible  to add any exported  names to “MyModule”, as they would enter the global scope and could  cause conflicts.  This could be fixed by requiring explicit listing of names:

from MyModule import this, that, the_other

but that is clumsy and  verbose – it doesn’t make life easy for the programmer.

Instead, each module will have some mechanism for identifying a group of symbols to be exported.  These might reasonably reflect different versions of the module interface.  Using this, the import statement might become

from MyModule import v1.1

Which would be a well defined set of names that would never change.  If the module needed new interfaces, they would be included in other versions, but not this one.  When the client program wanted to use those interfaces it could be changed to import “v1.2″, and at the same time any binding that now caused a conflict could be resolved.

It is too early in the design process to finalise the module import mechanism, so this is just indicative.  What it indicates is that avoiding a hole-in-scope seems possible and also advisable.  So that is what my language will do.

What, then, is the scope

Removing holes from the scope of bindings means we only need to determine the start and the end of a scope to provide a complete determination.  The start should be fairly straight forward – the scope of a binding starts exactly where that variable is bound.

This isn’t quite as straight forward as it sounds, as is shown by examples like:

name := name + 42

Is the use of “name” on the right-hand-side a use of the new “name” – which would have to be illegal – or of some old “name” – which would at best be confusing.

I think it is safest to rule this out.  It is tempting to suggest that the scope of a binding extends backwards to the start of the containing block, thus preventing the name from having any other meaning in earlier in the block.  We will almost require this, but not quite as will be seen.

The other end of the scope presents a different challenge.  The most obvious and common answer is that the scope ends at the end of the statement-block which contains the start of the scope, so in

if Expression :
   name := whatever
   statements
more-statements

the scope of “name” would extend through “statements” but would have ended by “more-statements”.  This seems natural but sometimes it isn’t what I want.

if Expression :
   name := something
   other statements
else:
   name := other-thing

might be a useful construct which declares “name” as one of two different values.  I might then want to use “name” later in the code.  Using the common rules described above we would need to declare the binding for name before the “if” and not give it a value there.

name:
if Expression :
   name = something
else:
   name = other-thing

I find that I don’t really like declaring a binding without actually binding anything to the name.  There may well be times when this is best and I’m not sure I would disallow it completely.  But I like to make it as optional as possible.  In the example with two bindings, a subsequent usage of “name” would imply no ambiguity and so should be allowed.

I am as yet unsure about a construct like:

if Expression:
  name := value
some statements
if Expression:
  Some statement using "name"

If the two Expressions are identical and depend only on values that haven’t changed, then “name” in the second statement block can reasonably be the same name that was bound in the first.

I am thinking that the scope of a binding extends:

  • At least to the end of the block which declares the binding. This ensures that a name cannot be bound twice in the same block.
  • At most to the end of the function which declares the binding
  • Conditionally to the remainder of  a block which encloses a block that ends with the name still in scope. The inner block must be guaranteed to declare the name precisely once on every path through the block.  The outer block must not declare the name at all, and may not use the name before the inner block.

This conditional scope extension seems to capture what I want, but feels a little bit dangerous – having any uncertainty over semantics is best avoided.  I like it enough that I think it is worth experimenting with.  If it turns out to be clumsy, something else will have to be found.

One consequence of this rule is that a name that is bound inside a looping construct (particularly in a “do” block) can never be used outside that loop.  In part, this is what the “for” part of a loop is for – it is where you declare bindings that are used in or created by a loop.

Another consequence is that if a name is bound in one “case” clause, but not all, then it must remain local to that “case” clause.

This approach to scope seem to match what I want, but there are some thorny issues related to the type of the binding, particular when the type is inferred from usage.

If a name is declared in both branches of an ‘if/else’, do we allow the two declarations to be of different types?  If we don’t it might seem to be overly restrictive – surely each declaration of a name should be independent.  If we do, then the name clearly cannot be used after the ‘if/else’ as it would not have a well-defined type.

This becomes more interesting if the usage in each branch does not completely define the type.  Maybe in one branch all operations are consistent with an integer, while in the other some operations imply a floating point number.  If the name is used after the if/else then it must be the same type everywhere and  so much be floating point.  If not, it could remain as an integer in one branch, which may result in subtly different behaviour.  This ambiguity is unlikely to be a good thing.

I think the best way to resolve the ambiguity is to determine that the type of a binding must be fully determined when the block containing the declaration completes.  If two branches complete with a particular name bound with two different types, then the binding does not extend beyond the branches, and a subsequent usage is illegal.

When the type system becomes rich enough that strict type/subtype relationship become meaningful, the type beyond the branches may be the super-type of the types within the branches, but for now it must be the same or nothing.

Implementing these scopes.

While parsing, we need to know which names are currently in scope and we need to track when two or more scopes for a name from different branches can be merged.

Each time a name declaration is found a new “variable” instance is created and is possibly linked to the previous instance if there is some chance that they can be merged. Each instance tracks the minimum nesting level at which it is valid.  This starts off at the nesting level where it is declared, but can be reduced when scopes in different branches are merged.

When a variable declaration is found, the current instance with that name is located.  If its nesting level is at most the current nesting level there we have a clearly illegal redeclaration.  If it is greater then any merge of the current instance is disabled and a new instance is created at this level.  Similarly if no current instance is found a new instance is created.

When a variable name is used and the current instance has a nesting level at least the current level, that instance is used.  If there is no instance we clearly have an error.  If the current instance has a greater nesting level and can be merged, the merge is finalized and the nesting level of the combined instance is set to the current level.

Each instance for a given name is linked to the previous instance which has been merged, and also the previous which hasn’t.  When a branch statement completes the list of unmerged instances is examined to see if they might be merged.

What made it to “Stoney Creek”

Many of the above ideas were written before I had finished and implementation.  Now (February 2018) I have my “Stoney Creek” version of the pre-language and it doesn’t quite match – though mostly it does.

The “name : type = value” option has not yet been implemented, only “name := value” and “name ::= value”, so types are still implicit.  Until I have interesting types, that doesn’t lose much.  The merging after an if/else of switch/case happens precisely when a name what was declared in multiple parallel branches (e.g.  then and else) is used afterwards.  The type is only considered afterwards and if a consistent type cannot be resolved, an error occurs.  This means that a given name can easily have different types in different branches if it is no subsequently used.  I don’t currently see that as a problem, but I might change my mind.

I’ve also introduced an undeclared type – the label.  If a name is used without being declared, then it is assumed to be a label with an arbitrary, unique, constant value.  This is only permitted in a “use” statement and in a “case” expression.  I don’t currently check that every label “used” has a “case”, or vice-versa, but that is a natural step.  I think I like this, but it is early days yet.

The other key part of the implementation is error messages.  A large part of the value of scopes is that it enables common errors to be detected, so of course I don’t realize the value without error messages.  Unlike the Falls Creek language, Stoney Creek produces error messages about inconsistent types, about multiply declared variables, and even about syntax errors, though that last is very limit as yet.

My next steps are to improve the syntax errors a little, then start adding complex types.  I have lots of ideas for types.  It will important to introduce them gradually, else it will be another 3 years before I post anything.

As before, the interpreter is in my ocean git tree (git://ocean-lang.org/ocean), and can be read as literate source code.

This entry was posted in Language Design. Bookmark the permalink.