Over the last week I have been working on performance optimizations for Church code. My basic test case is to run the OMeta parser (implemented in Church) on the “base.church” file in the Church runtime library. This file implements most of the Church runtime code. Parsing this file with the Common Lisp implementation of OMeta takes 0.5 seconds on my laptop. Initial runs of the Church code took well over a minute.
The first major change was to cache symbol lookups. Previously I was storing all symbol objects in an a-list (a linked list) and every time a symbol was used the table would be searched linearly using string comparisons. To avoid this lookup I modified the State compiler to take a (load-constant-value <raw-value>) form. When the state compiler compiles this form it creates a cell in the data segment from which to load a constant value. It also generates some initialization code which will call out to the runtime to initialize the constant.
By collecting all these initialization sequences and running them on startup we can intern all the symbols used by the program before it starts running. Each symbol reference thereafter is a simple memory load.
Using these cached symbols makes class lookup a lot easier too. Previously I stored all classes in a table keyed by symbol and searched it linearly when looking up classes. In the new implementation I store class objects in a “class” slot associated with the symbol representing the class’s name.
Similarly, the dispatch table was stored as a global list and searched first by symbol (selector) and then compared according to argument types. I moved the dispatch rules to the symbol objects, making the search much shorter.
These changes brought the instruction execution count down to about 40 billion instructions for Church (versus about 650 million for Lisp).
Further optimizations involved removing all cons’ing from the dynamic dispatch routines, inlining as many of those calls as possible to avoid function call overhead and rewriting recursive routines as tagbody loops.
I also reordered the tests in the “class-of” code to check the most common cases first.
All these optimizations brought the time down to about 4.5 seconds and the execution count to 13 billion. This is still about a factor of 10 slower than the lisp implementation. Eventually I hope to bring it to about a factor of 2, possible future optimizations are inline-caches for method lookup and optimized cons and closure allocation. Eventually I might also look at more sophisticated approaches, such as type inference, an optimizing compiler pass and runtime profiling.
To profile the code I tried two tools, Intel’s VTune and the callgrind module of valgrind. VTune was quite disappointing, besides having to paste serial numbers into an obscure installation utility (which failed the first few times) the download was 500 megabytes. After installation the sampling driver failed to work but I managed to run the call tracing module.
valgrind provides similar information to VTune, but the kcachegrind visualizer is much better, the call graph is very easy to work with and is also possible to see hot loops at the assembly level.