March 11th, 2009
After implementing inline slot caches I tried various approaches for caching method lookup. At first I thought I could store a global hash table which was keyed by combining the hash values of the types of the arguments passed to a function. Then by comparing the arguments types with the types stored in the table entries I could find the correct method to dispatch to. This turned out to be slower than my current implementation which simply walks the cons lists that describe each method’s type pattern and calling “type?” to test whether the argument is a subtype of the target type.
My eventual solution was to turn this lookup process into straight-line code by dynamically creating machine code to execute these tests.
Consider the following Church code:
length s:string
......
length n:nil
0
length l:cons
......
length a:array
......
Here we check the type of the argument to distinguish between nil, strings, cons-lists and arrays. For nil and cons types we can generate an efficient check against the object tag, for strings and arrays we call out to “type?”. The following state code is generated for these tests:
(DEFINE DISPATCH-MATCHER31
(LAMBDA (ARG1 ARG2 ARG3 ARG4 ARG5 ARG6 ARG7)
(LET ((ARG-COUNT (LOAD-ARGUMENT-COUNT)))
(IF (= ARG-COUNT 1)
(BEGIN
(IF
(AND (= (BAND ARG1 LOWTAG_BITS) TAG_REF)
(CHURCH-IF (TYPE? ARG1 137075987) 1 0))
(RETURN (CALL-C 134993347 1 (LOAD-CLOSURE-POINTER) ARG1)))
(IF
(AND (= (BAND ARG1 LOWTAG_BITS) TAG_REF)
(CHURCH-IF (TYPE? ARG1 137075507) 1 0))
(RETURN (CALL-C 134968518 1 (LOAD-CLOSURE-POINTER) ARG1)))
(IF (= (BAND ARG1 LOWTAG_BITS) TAG_CONS)
(RETURN (CALL-C 134968094 1 (LOAD-CLOSURE-POINTER) ARG1)))
(IF (= ARG1 TAG_NIL)
(RETURN (CALL-C 134968046 1 (LOAD-CLOSURE-POINTER) ARG1))))))
(CHURCH-DISPATCH-MATCHER-FAILED 1))))
The first two tests check for the TAG_REF tag which is used for normal objects. If the tag matches we call “type?” with the literal address representing the class of the target type. If the test matches we can call the associated method directly.
The other two tests only compare the tag, making it quite efficient for primitive types like cons, fixnums, true and nil.
After generating this code the calling function is patched to jump to this matcher method directly.
Together with the inline slot caches these modifications speed the system up by about a factor of two.
Future work involves determining more precisely when to compile these dispatchers (at the moment I do it after a symbol has been dispatched through 5000 times) and optimizing the generated tests. (Even in this example there is redundant check for TAG_REF). It is probably also possible to be more efficient when overlapping types are involved.
Posted in Church, State | No Comments »
March 11th, 2009
As part of my performance work I have implemented inline slot caches for slot reads and writes in Church. These are implemented by patching the program code at runtime with assembly code that checks the type of the target object and loads the correct offset for slot access.
In this example we see the code that prepares a call to ‘church-fixup-initial-slot-access’. The first two arguments on the stack (%esp) and 0×4(%esp) are the argument-count and closure pointer used in the State calling convention. The next two arguments 0×8(%esp) and 0xc(%esp) are the object being accessed and the symbol representing the name of the slot to be accessed.
0x080ce733 mov %ebx,0xc(%esp)
0x080ce737 mov %edx,0x8(%esp)
0x080ce73b mov %ecx,0x4(%esp)
0x080ce73f mov %eax,(%esp)
0x080ce742 call 0x80ab3b4
0x080ce747 mov %eax,%eax
0x080ce749 nop
0x080ce74a nop
0x080ce74b nop
0x080ce74c nop
The fixup routine gets the type of the object and examines all the parent classes to determine the correct offset for accessing the slot in this object. It then generates x86 machine code and patches the calling function. At the moment I do this by directly emitting a byte sequence for each instruction, this is quite crude and error-prone but manageable when such a small amount of code is being generated.
(write-byte! patch-start #x90)
(write-byte! patch-start #x90)
(write-byte! patch-start #x90)
(write-byte! patch-start #x90)
(write-byte! patch-start #x90)
(write-word! patch-start #x08244c8b)
(write-byte! patch-start #x81)
(write-byte! patch-start #x39)
(write-word! patch-start obj-type)
;je
(write-byte! patch-start #x74)
(write-byte! patch-start #x7)
...
First the old call is overwritten with nops and then we emit some comparison and jump instructions. The final output looks like this:
0x080ce733 mov %ebx,0xc(%esp)
0x080ce737 mov %edx,0x8(%esp)
0x080ce73b mov %ecx,0x4(%esp)
0x080ce73f mov %eax,(%esp)
0x080ce742 nop
0x080ce743 nop
0x080ce744 nop
0x080ce745 nop
0x080ce746 nop
0x080ce747 mov 0x8(%esp),%ecx
0x080ce74b cmpl $0x8302993,(%ecx)
0x080ce751 je 0x80ce75a
0x080ce753 call 0x80ab8ba
0x080ce758 jmp 0x80ce760
0x080ce75a mov 0x4(%ecx),%eax
0x080ce760 mov %eax,%eax
0x080ce762 nop
0x080ce763 nop
0x080ce764 nop
0x080ce765 nop
The untagged object pointer is moved into %ecx and the first word (which points to the class wrapper for this object) is compared with the literal address of the class wrapper seen the first time. If it is the same, we simply load the slot at the precomputed offset (0×4) and store it in %eax. If not we jump to a runtime function which does a conventional (but much slower) lookup.
Posted in Church, State | No Comments »
February 17th, 2009
I have released the next version of my Church-State system at:
http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-1.tar.gz.
This release adds proper Church macros and dynamic compilation to the system. In this example, a macro for moving an immediate value to a register
macro MOVLir imm reg
<<
_Or_L 0xb8 (_r4 \reg) \imm
>>
you can see quoted code wrapped in angular brackets and backslashes used to unquote values that must be inserted into the template.
The macro syntax is not final (like most of Church's syntax) and it should look cleaner as it matures.
To handle macros like these the system is now capable of taking Church source code parsed out of the macro body, compiling it to low-level State code and then compiling and linking the resulting machine code into the running image.
This involves resolving low level symbols (I use dlsym) and modifying the Church dispatch tables and macro rewriters to use the new macros. I also have to run an initialization function which sets up the "code vector" with constant values (interned symbols etc).
Now that I have dynamic compilation working, it should be fairly easy to add a REPL to the system.
Performance
As part of this release I have also disabled a lot of optimizations that I had worked on before. These include inline caches for method dispatch and slot access. The reason I have disabled these optimizations is that they cost too much in terms of space compared to the benefit they provide in improved speed.
I'm now pursuing a new approach which uses "class wrappers" marked with random seeds. The idea is that these seeds can be used to hash into lookup tables which memoize the effective method for dispatch. I hope to be able to incorporate these ideas plus others from contemporary implementations (of javascript vms etc) to make the system substantially faster.
Posted in Church, Open Source, State | No Comments »
February 6th, 2009
I’m proud to have reached the stage where my Church-State system can compile itself (ie the compiler is bootstrapped).
I have made the first alpha release available at:
http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-0.tar.gz
To try it out you’ll need a 32-bit x86 linux system with “ld” installed. (Usually ld will be installed if you’ve installed something like gcc).
There are two simple test files mentioned in the README and there are also instructions for bootstrapping the system.
One thing missing from the release is a compiler that compiles the output from the OMeta parser generator to Church files. That means it’s not possible to change the grammars just yet.
Another incomplete feature is that Church and State macros are hard-coded into the compiler. If you look at church-pass1.church and state-pass1.church you’ll see the various hard-coded macros (some of which are quite complex). To be able to include these macros in the source files where they are used I need to be able to dynamically compile and load church code. I’ve completed the first step of this process, see state-dynamic.church and church-test-dynamic-alloc.church for working code that can compile a church file down to native code, allocate memory for it and link it into the running image.
Once I have Church macros working, I plan to rewrite a lot of assembler-i386.church to use macros instead of functions for emitting machine instructions. I think that this will dramatically improve compilation times. While preparing for this release I did a lot of work on performance, even removing array bounds checking and some other safety checks to make it faster. Currently the system bootstraps in 90 seconds on my laptop, but my goal is to be 2 or 3 times as fast.
Posted in Church, Open Source, Programming, State | No Comments »
January 23rd, 2009
Up until now I have managed to make do without a garbage collector. My development machine has 4 gigabytes of RAM, which allowed me to compile all the test cases, but was not large enough for bootstrapping. I thought it would be fairly easy to plug in the Boehm garbage collector since my system is so simple, but I still encountered a few problems:
- The data objects in the ELF objects emitted by my compiler were not word-aligned. I had set the addralign field for the .data segment to 4, but the objects weren’t aligned. This might have caused the Boehm garbage collector to miss pointers in these objects.
- Previously I was using malloc, which on my system returns objects aligned to an 8-byte boundary. Church relies on this because it uses the lower 3 bits of a word as a tag. To get the Boehm gc to align allocated memory meant compiling it with the -DALIGN_DOUBLE flag. I couldn’t figure out how to do this with the latest 7.1 release, so I used the 6.8 release instead
- For some reason, if small objects are allocated they aren’t aligned either, so I have to request a minimum of 8 bytes per allocation
Posted in Church, State | No Comments »
January 16th, 2009
Since my last post about my implementation of OMeta in Common Lisp I have updated it to use a newer syntax. This implementation is called “ometa2″ for lack of a better name.
I have been using it to generate the Church parser for my Church and State project.
The previous version used to pass some arguments on the “OMeta stack” and in the new version I rewrote it to avoid this overhead. This meant creating an “ometa-input-stream” and a “list-input-stream” which allows reading objects from a list as if it was a stream.
The new version also uses a hash-table for memoization, which improves performance.
There are still however some performance concerns, the generated parsers contain too many function/lambda applications and some kind of optimization to reduce these will make the parsers much faster. For now the performance is sufficient for my work and reasonable considering the simple implementation.
You can get the code here:
http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/ometa2
Posted in Lisp | No Comments »
January 3rd, 2009
I have completed the port of the State compiler and the ELF writer. Still remaining is most of the Church compiler and some x86 machine instructions.
The ported system is in the “genesis” subdirectory (click on “files” to see the directory tree)
http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/
Update (Jan 15th): I have ported the whole system such that it can compile the Church runtime code and generate executables. The work remaining is to debug the generated code until it is able to compile and run the whole Church/State system.
Update (Jan 20th): All the Church test cases are working with the ported codebase.
Posted in Church, State | No Comments »
December 9th, 2008
I have ported a large part of the State compiler and the burg code-generator to be able to compile and assemble this simple test case:
(define |main| (lambda () 1))
It may seem trivial but to compile this example requires the implementation of:
- The parser framework (ometa) in Church
- The sexp parser for reading State source files
- The ometa-based macro expander for expanding State macros
- The AST conversion pass which constructs lexical scope information and tracks free variables
- The final compilation phase which creates a tree of pseudo-instruction objects
and that’s just the State compiler, the code generator has the following parts
- general framework for tracking liveness, register allocation and matching pseudo-instructions to machine instructions
- implementation of pseudo-instructions and rules for using them
- x86-specific description of the machine registers
- backend that produces actual machine code according to opcode formats (handling different addressing modes etc)
I have managed to generate the correct machine code bytes for this test case (but I don’t generate an ELF object file yet), here is the disassembled output:
cmalu% ndisasm -u burg_asm.output
00000000 55 push ebp
00000001 89E5 mov ebp,esp
00000003 68BBBB7707 push dword 0x777bbbb
00000008 68FFFF0000 push dword 0xffff
0000000D B80E000000 mov eax,0xe
00000012 8D4DF8 lea ecx,[ebp-0x8]
00000015 8901 mov [ecx],eax
00000017 B801000000 mov eax,0x1
0000001C 89C0 mov eax,eax
0000001E 59 pop ecx
0000001F 59 pop ecx
00000020 5D pop ebp
00000021 C3 ret
Which looks like the output produced by the lisp implementation of the system (except for a change in the stack marker constant):
00000027 :
27: 55 push %ebp
28: 89 e5 mov %esp,%ebp
2a: 68 bb bb aa aa push $0xaaaabbbb
2f: 68 ff ff 00 00 push $0xffff
34: b8 0e 00 00 00 mov $0xe,%eax
39: 8d 4d f8 lea -0x8(%ebp),%ecx
3c: 89 01 mov %eax,(%ecx)
3e: b8 01 00 00 00 mov $0x1,%eax
43: 89 c0 mov %eax,%eax
45: 59 pop %ecx
46: 59 pop %ecx
47: 5d pop %ebp
48: c3 ret
Posted in Church, State | No Comments »
November 18th, 2008
Since my last post on performance improvements I have achieved more speedups.
My primary test case is running an OMeta parser against the base.church runtime file. A critical factor in the performance of this type of parser is the memoization of previous parse results. (When the parser backtracks it may apply the same rule for the same input several times, if the results are memoized they can be reused instead of being recomputed).
Previously I had implemented a high-level hashtable in Church for the memo table. Due to the overhead of dynamic dispatch in the hashing and array access, this was quite slow. I replaced it with a low-level hashtable from kazlib. As a rule I have tried to minimize external dependencies for this project, but at this point I would rather reuse this external library than rewrite it in State.
Next I implemented inline caching for dynamic dispatch. By storing the argument types and the code pointer for the previous call in the “code vector” associated with a function, it’s possible to avoid the expensive lookup operation most of the time.
Lastly I also implemented inline caching for slot lookup. Slot lookup basically calculates the offset of the field within an object by counting all the slots in the class hierarchy. We cache this offset based on the argument type and the slot name.
All these changes yield a three-fold performance improvement. The test case now runs in 1 second versus 0.5 seconds for the Lisp implementation. The instruction read count is down from 13 billion before these changes to 2.7 billion.
This is a screenshot of the latest callgrind output. From this call graph for “church-make-closure” we can see that “church-alloc-object” is a candidate for optimization.

Posted in Church, Programming, State | No Comments »
November 18th, 2008
I have implemented a new calling convention for State. Functions now take:
<argument count> <closure pointer> <arg1> <arg2> ...
There are two special forms (load-argument-count) and (load-closure-pointer) for accessing these hidden arguments.
Adding an argument count allows implementation of inline caches (see next post) and a “rest parameter” which collects all extra parameters into a list
To implement the rest behaviour in the Church compiler I modified the church grammar to tag a rest parameter as a “:rest-var”
main args
foo 1 2 3 4 5
foo a *rest
print a
print rest
this will print
1
[2 3 4 5]
and then arrange for the code to call out to some runtime code to construct the list:
(define |church-setup-rest-var| (lambda (rest-var-pointer arg-count fixed-param-count)
(let ((rest-temp TAG_NIL)
(remaining-temp (- arg-count fixed-param-count))
(offset-temp 0))
(tagbody
check
; (call-c-extern |printf| "remaining-temp %lu
;" remaining-temp)
(if (= remaining-temp 0)
(begin
(set! (deref rest-var-pointer) (|church-reverse!| rest-temp))
(go end)))
(push (deref (+ rest-var-pointer offset-temp)) rest-temp)
(set! offset-temp (+ offset-temp 4))
(set! remaining-temp (- remaining-temp 1))
(go check)
end))))
Posted in Church, Programming, State | No Comments »