Since my last post on performance improvements I have achieved more speedups.
My primary test case is running an OMeta parser against the base.church runtime file. A critical factor in the performance of this type of parser is the memoization of previous parse results. (When the parser backtracks it may apply the same rule for the same input several times, if the results are memoized they can be reused instead of being recomputed).
Previously I had implemented a high-level hashtable in Church for the memo table. Due to the overhead of dynamic dispatch in the hashing and array access, this was quite slow. I replaced it with a low-level hashtable from kazlib. As a rule I have tried to minimize external dependencies for this project, but at this point I would rather reuse this external library than rewrite it in State.
Next I implemented inline caching for dynamic dispatch. By storing the argument types and the code pointer for the previous call in the “code vector” associated with a function, it’s possible to avoid the expensive lookup operation most of the time.
Lastly I also implemented inline caching for slot lookup. Slot lookup basically calculates the offset of the field within an object by counting all the slots in the class hierarchy. We cache this offset based on the argument type and the slot name.
All these changes yield a three-fold performance improvement. The test case now runs in 1 second versus 0.5 seconds for the Lisp implementation. The instruction read count is down from 13 billion before these changes to 2.7 billion.
This is a screenshot of the latest callgrind output. From this call graph for “church-make-closure” we can see that “church-alloc-object” is a candidate for optimization.
