ubik / 02a5277 docs / type-infer.txt

Tree @02a5277 (Download .tar.gz)

type-infer.txt @02a5277raw · history · blame

Compile-time Type Inference ============================================
Haldean Brown                                      First draft: Jun 2016
                                                  Last updated: Jul 2016

Status: Draft


A key part of the Ubik compilation process is type checking and type
inference. Both of these two steps happen in the same AST pass, called
"inference"; despite it's name, it's also responsible for type checking.

Inference happens module-by-module; since all top-level definitions have
types specified on them, no cross-module inference must occur. Ubik
only infers types "forwards", meaning that, at the point of declaration,
the tightest-possible type constraint on a name must be known. In
general, this shouldn't be a hindrance to a Ubik programmer; the biggest
"downside" of this is that top-level declarations must be given a type
by the programmer. The upside of this is a simplified type system that
can give better errors than a full Hindley-Milner-style type inferencer
while still providing most of the convenience that such a type engine

As the inferencer runs over an AST, it builds a set of "deductions" and
a set of "assertions". "Deductions" are things we know about the type of
an object; these are set at the time of the object's instantiation. On
the other hand, "assertions" are things that we hope to be true but
aren't sure of yet. At the end of AST processing, we ensure that every
assertion is satisfied by the deductions we made. For type checking, all
that is required is that the assertions generated are sufficient to
disallow any nonconformant Ubik program. For type inference, we need a
set of deduction rules such that we can propogate types forward

Before we go any further, I'm going to quickly define some sets so that
I can be terser throughout the rest of this:
    - E is the set of all valid Ubik expressions
    - T is the set of all valid Ubik types
    - D is the set of all possible derivations. D is equal to E x T: a
      pair (e, t) means that expression e has type t.
    - A is the set of all possible assertions. A is also equal to
      E x T, but its elements have a different semantic meaning; a pair
      (e, t) means that the type of "e" must be assignable to a type

AST inference is expression-based, and each expression has its own set
of deduction and assertion rules. Essentially what we're doing is making
a pair of functions over source expressions; if D is the set of all
derivations, E is the set of all expressions, and A is the set of all
assertions, we're defining:

    d : E -> 2^D
    a : E -> 2^A

These functions map an expression onto a set of derivations and
assertions, respectively. The set of derivations for an entire program
is the union of the sets of all derivations for each expression in the
program, and likewise with the assertions. Once these sets are built,
we can determine whether the program is valid through constraint
satisfaction (which I'll get to later).

First let's start by defining d and a. Since everything is
expression-based, and each expression works a little differently, let's
break it down by expression type.

Atomic expressions

Atoms can be pretty easy; source literals have the type of whatever
sort of atom they are. Names are a little trickier. Because of that,
let's break this down a little further into literals and names.

Source literals
As mentioned before, these are easy. Depending on the type of atom, a
source literal has what we'll call a "literal type" L. Literal integers
have a literal type of Word, literal strings have type String, literal
decimal numbers have type Number. The derivation rules for literals are

    d(e) : { (e, L) }

The assertion rules are empty; we know these types precisely:

    a(e) : { }

This is trickier, but not by much. The name that we're dealing with must
have been defined somewhere else (if that weren't the case, then we
would have been met with an error in an earlier pass). We go look up
where that name was defined and find out what type it was, and then say
the type of this expression is the same as that one. Thus, if e
has the value of the name n:

    d(e) : { (e, resolve_type(n)) }
    a(e) : { }

Function application

Alright, let's do something cooler than just copying stuff. This time,
we have an apply expression; an apply expression has a head ("the
function", "h") and a tail ("the argument", "t"), and applies the tail
to the head. The derivation rules for this are pretty simple: we know
that the derived value will have the type of whatever the function
returns, and we know that the argument should have the type of the first
argument to the head's type.

Let's define two partial functions over types before we give the formal
definitions for application. Both functions are only defined on
application expressions; for any other expression they are undefined
(and in the actual type checker, would cause a type error).

    apply (a -> b) = b
    head (a -> b) = a

With those functions defined, we can easily define d and a on
application expressions. As we said earlier, the return type of the
function is what we derive our expression's type to be:

    d(e) : { (e, apply(type(h))) }

And we assert the tail has the type of the head's argument:

    a(e) : { (t, head(type(h))) }


THIS DOESN'T WORK. we need some backwards inference if we're going to
make lambdas work without explicit types on all of them, because you
have to know how the arguments are going to be used.
For \a -> b:

    d(e) : { (e, type(a) -> type(b)) }
    a(e) : { (e, type(a) -> type(b)) }


For { : a = b ! x }:

    d(e) : { (e, type(x)) (a, type(b)) }
    a(e) : { }


For ? x { . p1 => t1 . p2 => t2 }:

    d(e) : { (e, root(t1, t2)) }
    a(e) : { (x, type(p1)) (x, type(p2)) }