As part of my performance work I have implemented inline slot caches for slot reads and writes in Church. These are implemented by patching the program code at runtime with assembly code that checks the type of the target object and loads the correct offset for slot access.
In this example we see the code that prepares a call to ‘church-fixup-initial-slot-access’. The first two arguments on the stack (%esp) and 0x4(%esp) are the argument-count and closure pointer used in the State calling convention. The next two arguments 0x8(%esp) and 0xc(%esp) are the object being accessed and the symbol representing the name of the slot to be accessed.
0x080ce733 mov %ebx,0xc(%esp) 0x080ce737 mov %edx,0x8(%esp) 0x080ce73b mov %ecx,0x4(%esp) 0x080ce73f mov %eax,(%esp) 0x080ce742 call 0x80ab3b4
0x080ce747 mov %eax,%eax 0x080ce749 nop 0x080ce74a nop 0x080ce74b nop 0x080ce74c nop
The fixup routine gets the type of the object and examines all the parent classes to determine the correct offset for accessing the slot in this object. It then generates x86 machine code and patches the calling function. At the moment I do this by directly emitting a byte sequence for each instruction, this is quite crude and error-prone but manageable when such a small amount of code is being generated.
(write-byte! patch-start #x90) (write-byte! patch-start #x90) (write-byte! patch-start #x90) (write-byte! patch-start #x90) (write-byte! patch-start #x90) (write-word! patch-start #x08244c8b) (write-byte! patch-start #x81) (write-byte! patch-start #x39) (write-word! patch-start obj-type) ;je (write-byte! patch-start #x74) (write-byte! patch-start #x7) ...
First the old call is overwritten with nops and then we emit some comparison and jump instructions. The final output looks like this:
0x080ce733 mov %ebx,0xc(%esp) 0x080ce737 mov %edx,0x8(%esp) 0x080ce73b mov %ecx,0x4(%esp) 0x080ce73f mov %eax,(%esp) 0x080ce742 nop 0x080ce743 nop 0x080ce744 nop 0x080ce745 nop 0x080ce746 nop 0x080ce747 mov 0x8(%esp),%ecx 0x080ce74b cmpl $0x8302993,(%ecx) 0x080ce751 je 0x80ce75a
0x080ce753 call 0x80ab8ba 0x080ce758 jmp 0x80ce760 0x080ce75a mov 0x4(%ecx),%eax 0x080ce760 mov %eax,%eax 0x080ce762 nop 0x080ce763 nop 0x080ce764 nop 0x080ce765 nop
The untagged object pointer is moved into %ecx and the first word (which points to the class wrapper for this object) is compared with the literal address of the class wrapper seen the first time. If it is the same, we simply load the slot at the precomputed offset (0x4) and store it in %eax. If not we jump to a runtime function which does a conventional (but much slower) lookup.