Macros and dynamic compilation

I have released the next version of my Church-State system at:

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-1.tar.gz.

This release adds proper Church macros and dynamic compilation to the system. In this example, a macro for moving an immediate value to a register


macro MOVLir imm reg
        <<
                _Or_L 0xb8 (_r4 \reg) \imm

>>

you can see quoted code wrapped in angular brackets and backslashes used to unquote values that must be inserted into the template.

The macro syntax is not final (like most of Church's syntax) and it should look cleaner as it matures.

To handle macros like these the system is now capable of taking Church source code parsed out of the macro body, compiling it to low-level State code and then compiling and linking the resulting machine code into the running image.

This involves resolving low level symbols (I use dlsym) and modifying the Church dispatch tables and macro rewriters to use the new macros. I also have to run an initialization function which sets up the "code vector" with constant values (interned symbols etc).

Now that I have dynamic compilation working, it should be fairly easy to add a REPL to the system.

Performance

As part of this release I have also disabled a lot of optimizations that I had worked on before. These include inline caches for method dispatch and slot access. The reason I have disabled these optimizations is that they cost too much in terms of space compared to the benefit they provide in improved speed.

I'm now pursuing a new approach which uses "class wrappers" marked with random seeds. The idea is that these seeds can be used to hash into lookup tables which memoize the effective method for dispatch. I hope to be able to incorporate these ideas plus others from contemporary implementations (of javascript vms etc) to make the system substantially faster.

Church release

I’m proud to have reached the stage where my Church-State system can compile itself (ie the compiler is bootstrapped).

I have made the first alpha release available at:

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-0.tar.gz

To try it out you’ll need a 32-bit x86 linux system with “ld” installed. (Usually ld will be installed if you’ve installed something like gcc).

There are two simple test files mentioned in the README and there are also instructions for bootstrapping the system.

One thing missing from the release is a compiler that compiles the output from the OMeta parser generator to Church files. That means it’s not possible to change the grammars just yet.

Another incomplete feature is that Church and State macros are hard-coded into the compiler. If you look at church-pass1.church and state-pass1.church you’ll see the various hard-coded macros (some of which are quite complex). To be able to include these macros in the source files where they are used I need to be able to dynamically compile and load church code. I’ve completed the first step of this process, see state-dynamic.church and church-test-dynamic-alloc.church for working code that can compile a church file down to native code, allocate memory for it and link it into the running image.

Once I have Church macros working, I plan to rewrite a lot of assembler-i386.church to use macros instead of functions for emitting machine instructions. I think that this will dramatically improve compilation times. While preparing for this release I did a lot of work on performance, even removing array bounds checking and some other safety checks to make it faster. Currently the system bootstraps in 90 seconds on my laptop, but my goal is to be 2 or 3 times as fast.

Boehm GC

Up until now I have managed to make do without a garbage collector. My development machine has 4 gigabytes of RAM, which allowed me to compile all the test cases, but was not large enough for bootstrapping. I thought it would be fairly easy to plug in the Boehm garbage collector since my system is so simple, but I still encountered a few problems:

  • The data objects in the ELF objects emitted by my compiler were not word-aligned. I had set the addralign field for the .data segment to 4, but the objects weren’t aligned. This might have caused the Boehm garbage collector to miss pointers in these objects.
  • Previously I was using malloc, which on my system returns objects aligned to an 8-byte boundary. Church relies on this because it uses the lower 3 bits of a word as a tag. To get the Boehm gc to align allocated memory meant compiling it with the -DALIGN_DOUBLE flag. I couldn’t figure out how to do this with the latest 7.1 release, so I used the 6.8 release instead
  • For some reason, if small objects are allocated they aren’t aligned either, so I have to request a minimum of 8 bytes per allocation

Bootstrap progress

I have completed the port of the State compiler and the ELF writer. Still remaining is most of the Church compiler and some x86 machine instructions.

The ported system is in the “genesis” subdirectory (click on “files” to see the directory tree)

http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/

Update (Jan 15th): I have ported the whole system such that it can compile the Church runtime code and generate executables. The work remaining is to debug the generated code until it is able to compile and run the whole Church/State system.

Update (Jan 20th): All the Church test cases are working with the ported codebase.

Bootstrapping the State compiler

I have ported a large part of the State compiler and the burg code-generator to be able to compile and assemble this simple test case:


(define |main| (lambda () 1)) 

It may seem trivial but to compile this example requires the implementation of:

  • The parser framework (ometa) in Church
  • The sexp parser for reading State source files
  • The ometa-based macro expander for expanding State macros
  • The AST conversion pass which constructs lexical scope information and tracks free variables
  • The final compilation phase which creates a tree of pseudo-instruction objects

and that’s just the State compiler, the code generator has the following parts

  • general framework for tracking liveness, register allocation and matching pseudo-instructions to machine instructions
  • implementation of pseudo-instructions and rules for using them
  • x86-specific description of the machine registers
  • backend that produces actual machine code according to opcode formats (handling different addressing modes etc)

I have managed to generate the correct machine code bytes for this test case (but I don’t generate an ELF object file yet), here is the disassembled output:


cmalu% ndisasm -u burg_asm.output
00000000  55                push ebp
00000001  89E5              mov ebp,esp
00000003  68BBBB7707        push dword 0x777bbbb
00000008  68FFFF0000        push dword 0xffff
0000000D  B80E000000        mov eax,0xe
00000012  8D4DF8            lea ecx,[ebp-0x8]
00000015  8901              mov [ecx],eax
00000017  B801000000        mov eax,0x1
0000001C  89C0              mov eax,eax
0000001E  59                pop ecx
0000001F  59                pop ecx
00000020  5D                pop ebp
00000021  C3                ret


Which looks like the output produced by the lisp implementation of the system (except for a change in the stack marker constant):


00000027 
: 27: 55 push %ebp 28: 89 e5 mov %esp,%ebp 2a: 68 bb bb aa aa push $0xaaaabbbb 2f: 68 ff ff 00 00 push $0xffff 34: b8 0e 00 00 00 mov $0xe,%eax 39: 8d 4d f8 lea -0x8(%ebp),%ecx 3c: 89 01 mov %eax,(%ecx) 3e: b8 01 00 00 00 mov $0x1,%eax 43: 89 c0 mov %eax,%eax 45: 59 pop %ecx 46: 59 pop %ecx 47: 5d pop %ebp 48: c3 ret

Further performance improvements

Since my last post on performance improvements I have achieved more speedups.

My primary test case is running an OMeta parser against the base.church runtime file. A critical factor in the performance of this type of parser is the memoization of previous parse results. (When the parser backtracks it may apply the same rule for the same input several times, if the results are memoized they can be reused instead of being recomputed).

Previously I had implemented a high-level hashtable in Church for the memo table. Due to the overhead of dynamic dispatch in the hashing and array access, this was quite slow. I replaced it with a low-level hashtable from kazlib. As a rule I have tried to minimize external dependencies for this project, but at this point I would rather reuse this external library than rewrite it in State.

Next I implemented inline caching for dynamic dispatch. By storing the argument types and the code pointer for the previous call in the “code vector” associated with a function, it’s possible to avoid the expensive lookup operation most of the time.

Lastly I also implemented inline caching for slot lookup. Slot lookup basically calculates the offset of the field within an object by counting all the slots in the class hierarchy. We cache this offset based on the argument type and the slot name.

All these changes yield a three-fold performance improvement. The test case now runs in 1 second versus 0.5 seconds for the Lisp implementation. The instruction read count is down from 13 billion before these changes to 2.7 billion.

This is a screenshot of the latest callgrind output. From this call graph for “church-make-closure” we can see that “church-alloc-object” is a candidate for optimization.

KCachegrind output for the callgrind tool

New calling convention for State

I have implemented a new calling convention for State. Functions now take:


<argument count> <closure pointer> <arg1> <arg2> ...

There are two special forms (load-argument-count) and (load-closure-pointer) for accessing these hidden arguments.

Adding an argument count allows implementation of inline caches (see next post) and a “rest parameter” which collects all extra parameters into a list

To implement the rest behaviour in the Church compiler I modified the church grammar to tag a rest parameter as a “:rest-var”


main args
     foo 1 2 3 4 5

foo a *rest
    print a
    print rest

this will print

1
[2 3 4 5]

and then arrange for the code to call out to some runtime code to construct the list:


(define |church-setup-rest-var| (lambda (rest-var-pointer arg-count fixed-param-count)
                                  (let ((rest-temp TAG_NIL)
                                        (remaining-temp (- arg-count fixed-param-count))
                                        (offset-temp 0))
                                    (tagbody
                                     check
;                                      (call-c-extern |printf| "remaining-temp %lu                                                                                       
;" remaining-temp)                                                                                                                                                       
                                       (if (= remaining-temp 0)
                                           (begin
                                            (set! (deref rest-var-pointer) (|church-reverse!| rest-temp))
                                            (go end)))
                                       (push (deref (+ rest-var-pointer offset-temp)) rest-temp)
                                       (set! offset-temp (+ offset-temp 4))
                                       (set! remaining-temp (- remaining-temp 1))
                                       (go check)
                                     end))))

Improving dispatch performance in Church

Over the last week I have been working on performance optimizations for Church code. My basic test case is to run the OMeta parser (implemented in Church) on the “base.church” file in the Church runtime library. This file implements most of the Church runtime code. Parsing this file with the Common Lisp implementation of OMeta takes 0.5 seconds on my laptop. Initial runs of the Church code took well over a minute.

The first major change was to cache symbol lookups. Previously I was storing all symbol objects in an a-list (a linked list) and every time a symbol was used the table would be searched linearly using string comparisons. To avoid this lookup I modified the State compiler to take a (load-constant-value <raw-value>) form. When the state compiler compiles this form it creates a cell in the data segment from which to load a constant value. It also generates some initialization code which will call out to the runtime to initialize the constant.

By collecting all these initialization sequences and running them on startup we can intern all the symbols used by the program before it starts running. Each symbol reference thereafter is a simple memory load.

Using these cached symbols makes class lookup a lot easier too. Previously I stored all classes in a table keyed by symbol and searched it linearly when looking up classes. In the new implementation I store class objects in a “class” slot associated with the symbol representing the class’s name.

Similarly, the dispatch table was stored as a global list and searched first by symbol (selector) and then compared according to argument types. I moved the dispatch rules to the symbol objects, making the search much shorter.

These changes brought the instruction execution count down to about 40 billion instructions for Church (versus about 650 million for Lisp).

Further optimizations involved removing all cons’ing from the dynamic dispatch routines, inlining as many of those calls as possible to avoid function call overhead and rewriting recursive routines as tagbody loops.

I also reordered the tests in the “class-of” code to check the most common cases first.

All these optimizations brought the time down to about 4.5 seconds and the execution count to 13 billion. This is still about a factor of 10 slower than the lisp implementation. Eventually I hope to bring it to about a factor of 2, possible future optimizations are inline-caches for method lookup and optimized cons and closure allocation. Eventually I might also look at more sophisticated approaches, such as type inference, an optimizing compiler pass and runtime profiling.

To profile the code I tried two tools, Intel’s VTune and the callgrind module of valgrind. VTune was quite disappointing, besides having to paste serial numbers into an obscure installation utility (which failed the first few times) the download was 500 megabytes. After installation the sampling driver failed to work but I managed to run the call tracing module.

valgrind provides similar information to VTune, but the kcachegrind visualizer is much better, the call graph is very easy to work with and is also possible to see hot loops at the assembly level.

Bootstrapping Church

As part of my work so far on the Church/State language, I have implemented a parser generator, two compilers, a machine code generator and a library to write ELF files. All of these were implemented in Common Lisp. In addition to this I have implemented runtime code for State (100 LOC) and for Church (about 2000 LOC).

The long term goal for this project is to bootstrap, achieved by rewriting all the components mentioned above in Church itself such that the system can eventually compile itself without relying on an external system such as Common Lisp or a C compiler.

So far I have ported OMeta (with the new JS style syntax) to Church and written a Church grammar that produces Church expressions as actions. I have also completed the basic framework of the code generator (enough to assemble code that adds two constant integers together).

Here is an outline of the remaining work:

  • Port Church compiler – the tricky party here is implementing macros. Currently I use ‘eval’ to compile macro bodies in lisp, to do this in church requires dynamic compilation of church code too.
  • Port State compiler – This includes writing a sexp parser (a lisp reader). Hopefully this will be fairly easy to do with the current OMeta implementation. I will probably also have to improve the quasiquote and splice operators in Church to implement this. Currently State macros are implemented in Lisp, I think the best approach will be to try rewrite them in State
  • Port the rest of code generator – Should be a fairly straightforward matter of implementing the remaining AST instructions and machine instructions
  • Port the ELF library – Since I used the binary-types library in the Lisp implementation, I’m not sure what approach will be necessary to do this in Church. I might end up implementing some kind of MOP system which allows special metaclasses and special slot attributes which can be used to control the binary output of class instances

Along the way I’m sure I will have to add more runtime functionality, and will probably also have to create an operator similar to destructuring-bind for deconstructing lists.

I hope to be able to avoid adding dynamically scoped variables, multiple value return and a “rest” parameter, all of which were used in the Lisp implementation.

First-class closures

I have implemented first-class closures in my experimental language Church/State. Until recently I was using partial closures (also called “downward funargs”) because I thought full closures would not be necessary to bootstrap the system. I found, however, that my implementation of OMeta in Church required full closures. Unlike partial closures which were stack allocated and could not escape the dynamic extent of the call that created them, full closures are heap allocated.

One of the consequences of this is that there is no longer such a clear separation between Church and State. (I named these languages as I did because I thought I could maintain a clear division between them). Previously State was supposed to implement a “static” language with very little run-time features. Now, however, because closure creation requires heap allocation, the State compiler has to call out to an allocation routine and create a runtime structure called a “closure-record”.

Since I wanted closure objects to look like other objects in Church, each closure record has a class-pointer and slots that are compatible with the Church object system. To implement this the State compiler essentially has to call out to Church runtime code, which makes the two systems mutually dependent, instead of having only Church rely on State.

Below is the Church code declaring the closure related classes. Each closure-record has a parent pointer to create a chain a closure-records. These closure-records are passed as a “hidden” argument to all State functions, meaning that the State calling convention is no longer compatible with the C calling convention. To call C routines from State, I have provided the “call-c” operator.

class closure-record
        with-slots
                parent
                slot-count

class closure
        with-slots
                closure-record
                function
        closure?
                true