Church-State developments

Since the last release of Church-State I started investigating whether it might be possible to make the runtime multi-threaded and to have the compiler execute on multiple threads.

Multi-threaded runtime

This turned out to be quite a lot work. For every global variable used by different threads in the compiler and parsers I had to introduce locking mechanisms. In the runtime I had to ensure that there was way to allocate dynamically generated machine code into a thread-local buffer. Care has to be taken with inline caches and self-modifying code. Any output to shared streams (such as stdout and stderr) need to be synchronized in some way. The garbage collector needs to ensure that only one thread can allocate at a time and collection needs to be coordinated so that mutating threads don’t run during collection. My approach was to use signal handlers to interrupt threads, each interrupted thread records the extent of its stack and then waits for garbage collection to complete.

Copying garbage collector

I also started experimenting with a copying collector instead of the existing mark-sweep implementation. Copying collectors are harder to implement because whenever you move an object you have to ensure that all the references to the old object (including on the call stack) are updated to point to the new one. Hashtables that hash object pointers need to be made aware of a completed garbage collection (and rehash if necessary). I never finished the new garbage collector because I felt that State (the low level language used for implementing the runtime) was a limited by a lack of language features. State does not have something like structs in C or pointer arithmetic. It doesn’t have a real looping mechanism (I use a lisp-style DO macro which I find ugly without pointer arithmetic). It doesn’t have syntax for array/pointer dereferencing or low level operations like bit shifting and masking which are common in runtime code.

Baste

I decided it would better to experiment with a richer low-level language and so I started creating a compiler for a language code-named ‘baste’. Baste has a syntax derived from Church (with Python/Haskell like indentation) and a 64-bit backend (Church-State is only implemented for 32-bit systems). My current experiments involve implementing a new register allocator that calculates live intervals for each variable and then allocates registers using a linear-scan register allocation algorithm. This new work is included in the Church-State source repository.

New release of Church-State

I have made a new release of Church-State.

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-8.tar.gz

A new garbage collector

This is a new release of Church-State

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-7.tar.gz

New hashtable

In this release I have added a new hashtable implemented using the State language. Previously I was using a hashtable implemented in C from the kazlib library.

The new hashtable uses open addressing with double hashing to resolve probe collisions. Besides the standard hashtable which maps one key to one value, I have also implemented a two-key two-value hashtable used for the memoization of the OMeta parser results. For each grammar rule (key1) and the input position (key2) we store the result (value1) and the end position of the cached result (value 2). Previously I used two levels of hashtables for this.

You can see the hashtable here:

http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/file/f32bc9d22355/church/runtime/hashtable.state

New garbage collector

I have also added a new garbage collector to replace the Boehm collector used so far. Like the Boehm collector this is a conservative mark and sweep collector.

The implementation is fairly simple, we create two heaps. The first is divided in pagelets which are 1024 bytes in size. Each of these pagelets can store objects which are 8,16,32,64,128,512 or 1024 bytes in size. We maintain bitfields to remember which objects are allocated within each page. When all the objects in a pagelet are freed during garbage collection that pagelet is returned to a free pool and may be used to store objects of a different size. As part of this work I have implemented a guard page and an associated signal handler to help detect when new free pagelets should be assigned to a size class.

The other heap is for objects larger than 1024 bytes. We use a simple first-fit strategy. To help mitigate the effect of the fragmentation of free spaces, we coalesce adjacent free spaces into larger blocks during the sweep phase.

My measurements show that the new allocator is about twice as slow as the Boehm allocator when allocating cons cells. For the entire bootstrap process the new allocator adds about 2 seconds of overhead, taking the bootstrap time from 23 to 25 seconds.

The new garbage collector is in state/runtime/state_runtime.state .

New OMeta interpreter for Church-State

This is a new release of Church-State

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-6.tar.gz

This incorporates a new OMeta interpreter that I have written to replace the parser generator used in previous versions.

Implementing the Church parser with an interpreter rather than with a parser generator has two main benefits:

  • There is no longer a requirement to have the church compiler available to generate a new parser. This caused some bootstrapping issues and made it harder to provide user-extensible parsing.
  • The new approach is faster than before. Usually interpreters have more overhead than compiled equivalents but in this case the compilation overhead was very large for the generated church parser. Overall it’s faster to parse the church grammar file into an intermediate structure and interpret it than it is to generate and compile a very large generated parser.

The new system bootstraps in 28 seconds now (versus 34 seconds previously).

Grammar actions

The most novel part of this new interpreter implementation is the way that I handle “grammar actions”.

Previously I implemented both a parser generator and an interpreter which used runtime code compilation to turn grammar actions into executable closures.

Take this example of the grammar rule for an assignment:

assignment = assignable-expression:lhs ws* "=" ws* expression(nil):rhs -> <<`[_assign ,lhs ,rhs]>>

which would parse the following Church code:


a = 1

In some parsing frameworks it is convenient to automatically construct a parse tree from a given grammar. In the case of scannerless parsers (which is what I use in Church) we would like to elide the nodes that would result from parsing whitespace. So in the example above the parts that we want to keep are labeled lhs and rhs. The section at the end describes a quasiquoted list template that has the values of lhs and rhs substituted into it.

When evaluated, this action will return something like this:
[_assign [_var "a"] [_number 1]]

Closures

To implement this one can create a closure with the following code:

fn arg1 arg2 -- `[_assign ,arg1 ,arg2]

which is called at parse time with the values for each labeled part.

This requires having the entire compiler framework available at parse time, including the parser which you are implementing!

To avoid this kind of bootstrapping issue and to simplify the parser, I decided to implement grammar actions with a simple quasiquoting syntax.

Quasiquoted actions

In the grammar actions implemented in the latest Church-State parser it is only possible to quote a list or symbol, unquote a variable or a list and specify a literal.

For example the following grammar rule:

pattern = cname:name ws* (ws* pattern-argument)*:args -> <<`(_pattern ,name ,@args)>>

creates a list starting with the symbol _pattern followed by the pattern name and then containing all the elements of the args list.

With closures it is possible to use list functions like cons, first and rest or to call routines to convert strings of digits into integers. This is not possible with quasiquoted actions and they require a later pass of the parse tree to do the necessary conversions. I have added routines to perform these transformations so that the resulting parse tree is exactly the same as that returned by the previous parser implementations.

The interpreter can be seen here:

http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/file/cbea15e41381/genesis/ometa/ometa-interpreter.church

along with the grammar for parsing ometa grammars:

http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/file/d6a2c4d7c783/genesis/ometa/ometa-parser.g

Church alpha 5

This is a new release of Church-State.

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-5.tar.gz

To do a bootstrap, type make bootstrap1 and make bootstrap2.

To test the new ometa interpreter, run make ometa-test-parser and run it with ./bin/ometa-test-parser test.g test.txt

This release contains the following new items:

  • A new OMeta interpreter (Although this new interpreter is capable of parsing the full Church grammar, I’m still keeping the old OMeta parser generator for bootstrapping reasons)
  • Optimized dispatch structures for faster dispatch when inline caching is not effective
  • Improved inline caches

Loop macros

I have released the latest version of Church-State here:
http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-4.tar.gz

You can also get the source code via mercurial:

hg clone http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/

or browse it.

In the previous version of Church loop constructs were expanded in the Church compiler directly to the State “tagbody” form. I have removed this code and implemented it as a Church macro which expands to a new Church “tagbody” form.

The aim behind this is to make it easier to do optimization and flow analysis on Church code in the future.

Rude repl

I have added eval and a repl (read eval print loop) to Church.

> + 3 4
7
> length "foobar"
6
> eval "* 19 3"
57
> load "church/test/hello.church"
true
> (main)
"hello"
true

Up to now I have only been using Church as a command-line compiler which produces executable files. Yet I have always preferred interactive language environments (lisp, smalltalk, python etc) to stop-compile-run languages.

All the machinery for writing a repl has been in Church for a while, including

  • Church parser available as a library
  • Church compiler available at runtime
  • Machine code generator and dynamic linker available at runtime
  • The ability to modify the dispatch table at runtime

This makes the repl easy to implement:

repl
        loop
                do
                        rstr = (read-line)
                        if (null? rstr)
                                return-from repl nil
                        else
                                print (eval rstr)

Eval is a little bit more messy because the compiler is designed to compile a whole method at a time and doesn’t know what to do with variables that are not either a global or local variable.

To implement eval I wrapped the eval string in a lambda that I return from a method which gets run after compilation.

set-eval-compiled-function
   eval-compiled-function = (fn -- eval_str) 

This works for certain language expressions, but does not presently provide the ability to assign to “repl” variables.

Another drawback is that multiline statements are currently not possible (unlike python and lisp).

In the future I hope to experiment with a SLIME-like extension to emacs which communicates to Church across a socket, allowing interactive evaluation and compilation of source code from within the editor.

Church alpha 3

This release fixes some bugs:

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-3.tar.gz

  • When I first wrote some of the church compiler passes and the sexp parser I used global variables to store state specific to that pass. I knew that only one thread would be running and that these passes didn’t have to be reentrant. Later I added functionality to compile macros on the fly and to compile dispatch matchers. When I did this I tried to rewrite all the modules that used global state, but I missed the sexp parser and this caused some strange bugs later on.
  • When allocating memory for dynamically allocated code, I was storing the address of the memory block in a fixnum. This worked fine until malloc started returning addresses from higher memory which used the most significant two bits in a word. These would get shifted out when tagging a fixnum and yield an invalid address. To work around this I now box the address by storing it in an array

Compiling dispatch matchers at runtime

After implementing inline slot caches I tried various approaches for caching method lookup. At first I thought I could store a global hash table which was keyed by combining the hash values of the types of the arguments passed to a function. Then by comparing the arguments types with the types stored in the table entries I could find the correct method to dispatch to. This turned out to be slower than my current implementation which simply walks the cons lists that describe each method’s type pattern and calling “type?” to test whether the argument is a subtype of the target type.

My eventual solution was to turn this lookup process into straight-line code by dynamically creating machine code to execute these tests.

Consider the following Church code:


length s:string
   ......

length n:nil
   0
                                                                                                                                                         
length l:cons
   ......                                                                                                                                                 
                                                                                                                                                                        
length a:array
   ......


Here we check the type of the argument to distinguish between nil, strings, cons-lists and arrays. For nil and cons types we can generate an efficient check against the object tag, for strings and arrays we call out to “type?”. The following state code is generated for these tests:


 (DEFINE DISPATCH-MATCHER31
  (LAMBDA (ARG1 ARG2 ARG3 ARG4 ARG5 ARG6 ARG7)
    (LET ((ARG-COUNT (LOAD-ARGUMENT-COUNT)))
      (IF (= ARG-COUNT 1)
          (BEGIN
           (IF
            (AND (= (BAND ARG1 LOWTAG_BITS) TAG_REF)
                 (CHURCH-IF (TYPE? ARG1 137075987) 1 0))
            (RETURN (CALL-C 134993347 1 (LOAD-CLOSURE-POINTER) ARG1)))
           (IF
            (AND (= (BAND ARG1 LOWTAG_BITS) TAG_REF)
                 (CHURCH-IF (TYPE? ARG1 137075507) 1 0))
            (RETURN (CALL-C 134968518 1 (LOAD-CLOSURE-POINTER) ARG1)))
           (IF (= (BAND ARG1 LOWTAG_BITS) TAG_CONS)
               (RETURN (CALL-C 134968094 1 (LOAD-CLOSURE-POINTER) ARG1)))
           (IF (= ARG1 TAG_NIL)
               (RETURN (CALL-C 134968046 1 (LOAD-CLOSURE-POINTER) ARG1))))))
    (CHURCH-DISPATCH-MATCHER-FAILED 1))))

The first two tests check for the TAG_REF tag which is used for normal objects. If the tag matches we call “type?” with the literal address representing the class of the target type. If the test matches we can call the associated method directly.

The other two tests only compare the tag, making it quite efficient for primitive types like cons, fixnums, true and nil.

After generating this code the calling function is patched to jump to this matcher method directly.

Together with the inline slot caches these modifications speed the system up by about a factor of two.

Future work involves determining more precisely when to compile these dispatchers (at the moment I do it after a symbol has been dispatched through 5000 times) and optimizing the generated tests. (Even in this example there is redundant check for TAG_REF). It is probably also possible to be more efficient when overlapping types are involved.

Inline slot caches

As part of my performance work I have implemented inline slot caches for slot reads and writes in Church. These are implemented by patching the program code at runtime with assembly code that checks the type of the target object and loads the correct offset for slot access.

In this example we see the code that prepares a call to ‘church-fixup-initial-slot-access’. The first two arguments on the stack (%esp) and 0x4(%esp) are the argument-count and closure pointer used in the State calling convention. The next two arguments 0x8(%esp) and 0xc(%esp) are the object being accessed and the symbol representing the name of the slot to be accessed.


0x080ce733 mov    %ebx,0xc(%esp)
0x080ce737 mov    %edx,0x8(%esp)
0x080ce73b mov    %ecx,0x4(%esp)
0x080ce73f mov    %eax,(%esp)
0x080ce742 call   0x80ab3b4 
0x080ce747 mov    %eax,%eax
0x080ce749 nop    
0x080ce74a nop    
0x080ce74b nop    
0x080ce74c nop    


The fixup routine gets the type of the object and examines all the parent classes to determine the correct offset for accessing the slot in this object. It then generates x86 machine code and patches the calling function. At the moment I do this by directly emitting a byte sequence for each instruction, this is quite crude and error-prone but manageable when such a small amount of code is being generated.


              (write-byte! patch-start #x90)
              (write-byte! patch-start #x90)
              (write-byte! patch-start #x90)
              (write-byte! patch-start #x90)
              (write-byte! patch-start #x90)

              (write-word! patch-start #x08244c8b)
              (write-byte! patch-start #x81)
              (write-byte! patch-start #x39)
              (write-word! patch-start obj-type)
                                        ;je                                                                                                                              
              (write-byte! patch-start #x74)
              (write-byte! patch-start #x7)
...

First the old call is overwritten with nops and then we emit some comparison and jump instructions. The final output looks like this:


0x080ce733 mov    %ebx,0xc(%esp)
0x080ce737 mov    %edx,0x8(%esp)
0x080ce73b mov    %ecx,0x4(%esp)
0x080ce73f mov    %eax,(%esp)
0x080ce742 nop    
0x080ce743 nop    
0x080ce744 nop    
0x080ce745 nop    
0x080ce746 nop    
0x080ce747 mov    0x8(%esp),%ecx
0x080ce74b cmpl   $0x8302993,(%ecx)
0x080ce751 je     0x80ce75a 
0x080ce753 call   0x80ab8ba 
0x080ce758 jmp    0x80ce760 
0x080ce75a mov    0x4(%ecx),%eax
0x080ce760 mov    %eax,%eax
0x080ce762 nop    
0x080ce763 nop    
0x080ce764 nop    
0x080ce765 nop    

The untagged object pointer is moved into %ecx and the first word (which points to the class wrapper for this object) is compared with the literal address of the class wrapper seen the first time. If it is the same, we simply load the slot at the precomputed offset (0x4) and store it in %eax. If not we jump to a runtime function which does a conventional (but much slower) lookup.