Exception handling

In my previous post I mentioned work on a new register allocator. Something that was missing from my Church-State project was error recovery. Most languages today use some form of try-catch exception handling (although the Go language uses deferred statements to perform cleanup).

The biggest challenge in implementing exception handling is that functions that use “callee-saved” registers may save these register values on the stack and then use them to store private values. Under normal control flow these values will be loaded from the stack and restored when the function returns. In the case of an exception, the runtime has to inspect each stack frame on the call stack between the point where the exception was thrown and the point where it will be caught and restore the values of these callee saved registers.

The implementation I chose was to emit a label for the function ‘epilogue’, the final sequence of statements before the return instruction, and store this in the object file. At exception handling time I have a routine implemented in assembly which will walk the stack and arrange to invoke this epilogue stub for each function (adjusting the stack pointer appropriately so that the function can find the saved values). Finally it jumps to the exception handler in the context of the function with the try-catch handler.

One complication arises in that it is not possible for variables that are used in the catch block to be stored temporarily in registers in the try block. The risk is that control is transferred out of the try block to the catch block and the catch block won’t know which register holds the current value. This can be avoided by treating every single statement that could raise an exception as having an extra control flow edge to the catch block. I implemented a simpler solution which was to mark all variables used in the catch block for stack allocation only. These variables are effectively permanently spilled to the stack.

The assembly routine to handle exceptions and restore callee saved values from the stack is pretty hairy but I’m quite happy with the solution.


     7 	;; // Thread local storage has the following layout
     8 	;; // (some other thread's heap)	   <------------ limit-pointer
     9 	;; // heap
    10 	;; // heap
    11 	;; // heap
    12 	;; // heap		   		   <------------ allocation-pointer
    13 	;; // heap
    14 	;; // heap
    15 	;; // heap   start of heap                			              %r15
    16 	;; // limit pointer  							-0x8 (%r15)
    17 	;; // allocation pointer  (initially points to start of heap)		-0x10(%r15)
    18 	;; // saved exception object 						-0x18(%r15)
    19 	;; // saved return address when calling unwind stub			-0x20(%r15)
    20 	;; // saved frame pointer for exception trace				-0x28(%r15)
    21 	;; // working stack pointer to be used during stack trace traversal 	-0x30(%r15)
    22 	;; // saved traversal stack pointer 					-0x38(%r15)

...

;; //--------------------------------------------------------------------------------
    57 ;; // Exception handling
    58 ;; //--------------------------------------------------------------------------------
    59 
    60 	
    61 ;; //transfer to exception handler
    62 	.globl __l_throw_exception
    63 
    64 
    65 .macro switch_to_original_stack
    66 	mov %rsp, -0x30(%r15)
    67 	mov -0x38(%r15), %rsp
    68 .endm
    69 
    70 .macro switch_to_working_stack
    71 	mov %rsp, -0x38(%r15)
    72 	mov -0x30(%r15), %rsp
    73 .endm
    74 	
    75 __l_throw_exception:
    76 ;; // first argument is the exception object
    77 ;; // save it in thread local memory
    78 	mov %rdi, -0x18(%r15)
    79 	;; // also save the frame pointer so that we can generate a stack trace later if necessary
    80 	mov %rbp, -0x28(%r15)
    81 	;; // and keep a pointer to a 'working stack' so that we call helper routines without destroying the original stack frames
    82 	mov %rsp, -0x30(%r15)
    83 __find_exception_handler:
    84 	;; // get the return address from the stack
    85 	;; // and check if there is an exception handler installed around that address
    86 	mov (%rsp), %rdi
    87 	switch_to_working_stack
    88 	call __l_find_exception_handler
    89 	switch_to_original_stack
    90 	test %rax,%rax
    91 	jnz found_exception_handler
    92 	;; // if there is no exception handler, unwind this frame by finding the unwind stub for this function
    93 	mov (%rsp), %rdi
    94 	switch_to_working_stack
    95 	call __l_find_unwind_stub
    96 	switch_to_original_stack
    97 	test %rax,%rax
    98 	jz no_unwind
    99 	;; // pop off the return address (restore the stack to the calling function's expected value)
   100 	pop %rcx
   101 	;; // get the return address of the previous frame and save it to thread local memory
   102 	mov 0x8(%rbp), %rdx
   103 	mov %rdx, -0x20(%r15)
   104 	;; // do some magic to get the address of 'after_unwind_stub'
   105 	lea 0x6(%rip), %rcx
   106 	;; // save it in place of the normal return address
   107 	mov %rcx, 0x8(%rbp)
   108 	;; // branch to the unwind stub
   109 	jmp *%rax
   110 after_unwind_stub:
   111 	;; // now restore the real return address that we stored in thread local memory
   112 	mov -0x20(%r15), %rax
   113 	push %rax
   114 	;; // repeat from the top until we find an exception handler
   115 	jmp __find_exception_handler
   116 
   117 found_exception_handler:
   118 	;; // load the exception object that we stored in thread local memory
   119 	mov -0x18(%r15), %rdi
   120 	;; // pop old return address
   121 	pop %rcx
   122 	;; // branch to exception handler
   123 	jmp *%rax
   124 	
   125 no_unwind:	
   126 no_exception_handler:
   127 	mov -0x28(%r15), %rdi
   128 	mov -0x18(%r15), %rsi
   129 	switch_to_working_stack
   130 	call __l_top_level_exception_handler
   131 	;;  // no return?
   132 	int $3
   133 

Skype on 64-bit debian

Skype for linux 4.2.0

This debian package also depends on a multiarch setup.

wget http://download.skype.com/linux/skype-debian_4.2.0.11-1_i386.deb
sudo dpkg -i skype-debian_4.2.0.11-1_i386.deb


Skype for linux 4.0.0

Installing skype on a 64-bit linux machine is tricky. There is no 64-bit binary.

The best instructions I found are here.

I had to install the following 32-bit libraries:

apt-get install libpcre3:i386 libglib2.0-0:i386 libasound2:i386 libc6:i386 libgcc1:i386 libqt4-dbus:i386 libqt4-network:i386 libqt4-xml:i386 libqtcore4:i386 libqtgui4:i386 libqtwebkit4:i386 libstdc++6:i386 libx11-6:i386 libxss1:i386 libxv1:i386 libssl1.0.0:i386

I also ran into a bug with the internal microphone on my IdeaPad U300s

[Lenovo IdeaPad U300s, Conexant CX20590, Mic, Internal] Background noise or low volume (phase inversion)

Loop macros

I have released the latest version of Church-State here:
http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-4.tar.gz

You can also get the source code via mercurial:

hg clone http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/

or browse it.

In the previous version of Church loop constructs were expanded in the Church compiler directly to the State “tagbody” form. I have removed this code and implemented it as a Church macro which expands to a new Church “tagbody” form.

The aim behind this is to make it easier to do optimization and flow analysis on Church code in the future.

User extensible parsing

I have been experimenting with user-extensible parsers in Church, even though I don’t have a use for them yet.

I added a new language construct called “extend-grammar”, ie:


extend-grammar ometa testg {
test-rule = "test" ws+ cname:a -> << 42 >>
}

This test-rule will match the string “test”, followed by a name. The rule will ignore its input and return the value 42.

This grammar is processed during parse time, converted to church code and dynamically compiled and linked into the running process.

At present this happens after the whole file is parsed, so it’s currently not possible to add a new grammar rule and use it in the same source file.

To activate the rule I added another construct, “eval-when”, ie:


eval-when compile load
      church-add-parser-extension 'test-rule     
                                                                                    

What this does is to execute this code when this file is compiled (load doesn’t work yet).

In this example we add ‘test-rule to the list of ometa functions to be called by a special new grammar rule called ‘user-form.

The new version of ‘user-form is compiled and linked into the process, immediately making it available to the parser.

For this test case I then load the following file:


dotest
        print "in dotest"
        print (test notanumber)

which prints 42 because the parser has intercepted what would normally be a method call.

Rude repl

I have added eval and a repl (read eval print loop) to Church.

> + 3 4
7
> length "foobar"
6
> eval "* 19 3"
57
> load "church/test/hello.church"
true
> (main)
"hello"
true

Up to now I have only been using Church as a command-line compiler which produces executable files. Yet I have always preferred interactive language environments (lisp, smalltalk, python etc) to stop-compile-run languages.

All the machinery for writing a repl has been in Church for a while, including

  • Church parser available as a library
  • Church compiler available at runtime
  • Machine code generator and dynamic linker available at runtime
  • The ability to modify the dispatch table at runtime

This makes the repl easy to implement:

repl
        loop
                do
                        rstr = (read-line)
                        if (null? rstr)
                                return-from repl nil
                        else
                                print (eval rstr)

Eval is a little bit more messy because the compiler is designed to compile a whole method at a time and doesn’t know what to do with variables that are not either a global or local variable.

To implement eval I wrapped the eval string in a lambda that I return from a method which gets run after compilation.

set-eval-compiled-function
   eval-compiled-function = (fn -- eval_str) 

This works for certain language expressions, but does not presently provide the ability to assign to “repl” variables.

Another drawback is that multiline statements are currently not possible (unlike python and lisp).

In the future I hope to experiment with a SLIME-like extension to emacs which communicates to Church across a socket, allowing interactive evaluation and compilation of source code from within the editor.

jonesforth64

I have ported jonesforth to 64-bit x86 code.

jonesforth is a tutorial-style implementation of forth which explains in detail how the compiler and runtime is implemented. Porting the code to a slightly different assembly language helped me to think carefully about what each primitive does and about how they are used in the runtime code.

As noted in the jonesforth comments, the original advantage of using direct-threaded code on a 16-bit machine is that calling each word can be encoded in two bytes instead of three. That’s a savings of 33%. On 32-bit x86, it’s four bytes versus five, saving 20%. In my 64-bit implementation I chose to extend the 4-byte addresses to 8-byte words. This actually results in wasting space rather than saving it because on x86-64 calls and branches are usually encoded with a 5-byte instruction using relative displacement.

The port was fairly straightforward, I mostly just replaced 32-bit registers (eax, esp, esi etc) with the 64-bit equivalents (rax, rsp, rsi etc) and changed every reference to the word-size from 4 to 8.

The biggest difference is that syscalls use different registers on 64-bit linux and these registers can be clobbered during the call.

You can get the code from the mercurial repository (or browse it here):

hg clone http://subvert-the-dominant-paradigm.net/repos/jonesforth64/

To compile it:


gcc -m64 -nostdlib -static -Wl,-Ttext,0 -Wl,--build-id=none -o jonesforth64 jonesforth64.S

To run it:


cat jonesforth64.f - | ./jonesforth64
JONESFORTH VERSION 45
6393 CELLS REMAINING
OK
1 2 3 4
.S
4 3 2 1
ROT
.S
3 2 4 1

I’ve tested most of the code in the .f file, but I haven’t yet implemented C strings, file-io or the built-in assembler.

I’ve tried to keep the comments intact, but haven’t updated them to reflect different word sizes or registers etc.

Macros and dynamic compilation

I have released the next version of my Church-State system at:

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-1.tar.gz.

This release adds proper Church macros and dynamic compilation to the system. In this example, a macro for moving an immediate value to a register


macro MOVLir imm reg
        <<
                _Or_L 0xb8 (_r4 \reg) \imm

>>

you can see quoted code wrapped in angular brackets and backslashes used to unquote values that must be inserted into the template.

The macro syntax is not final (like most of Church's syntax) and it should look cleaner as it matures.

To handle macros like these the system is now capable of taking Church source code parsed out of the macro body, compiling it to low-level State code and then compiling and linking the resulting machine code into the running image.

This involves resolving low level symbols (I use dlsym) and modifying the Church dispatch tables and macro rewriters to use the new macros. I also have to run an initialization function which sets up the "code vector" with constant values (interned symbols etc).

Now that I have dynamic compilation working, it should be fairly easy to add a REPL to the system.

Performance

As part of this release I have also disabled a lot of optimizations that I had worked on before. These include inline caches for method dispatch and slot access. The reason I have disabled these optimizations is that they cost too much in terms of space compared to the benefit they provide in improved speed.

I'm now pursuing a new approach which uses "class wrappers" marked with random seeds. The idea is that these seeds can be used to hash into lookup tables which memoize the effective method for dispatch. I hope to be able to incorporate these ideas plus others from contemporary implementations (of javascript vms etc) to make the system substantially faster.

Church release

I’m proud to have reached the stage where my Church-State system can compile itself (ie the compiler is bootstrapped).

I have made the first alpha release available at:

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-0.tar.gz

To try it out you’ll need a 32-bit x86 linux system with “ld” installed. (Usually ld will be installed if you’ve installed something like gcc).

There are two simple test files mentioned in the README and there are also instructions for bootstrapping the system.

One thing missing from the release is a compiler that compiles the output from the OMeta parser generator to Church files. That means it’s not possible to change the grammars just yet.

Another incomplete feature is that Church and State macros are hard-coded into the compiler. If you look at church-pass1.church and state-pass1.church you’ll see the various hard-coded macros (some of which are quite complex). To be able to include these macros in the source files where they are used I need to be able to dynamically compile and load church code. I’ve completed the first step of this process, see state-dynamic.church and church-test-dynamic-alloc.church for working code that can compile a church file down to native code, allocate memory for it and link it into the running image.

Once I have Church macros working, I plan to rewrite a lot of assembler-i386.church to use macros instead of functions for emitting machine instructions. I think that this will dramatically improve compilation times. While preparing for this release I did a lot of work on performance, even removing array bounds checking and some other safety checks to make it faster. Currently the system bootstraps in 90 seconds on my laptop, but my goal is to be 2 or 3 times as fast.

Improving dispatch performance in Church

Over the last week I have been working on performance optimizations for Church code. My basic test case is to run the OMeta parser (implemented in Church) on the “base.church” file in the Church runtime library. This file implements most of the Church runtime code. Parsing this file with the Common Lisp implementation of OMeta takes 0.5 seconds on my laptop. Initial runs of the Church code took well over a minute.

The first major change was to cache symbol lookups. Previously I was storing all symbol objects in an a-list (a linked list) and every time a symbol was used the table would be searched linearly using string comparisons. To avoid this lookup I modified the State compiler to take a (load-constant-value <raw-value>) form. When the state compiler compiles this form it creates a cell in the data segment from which to load a constant value. It also generates some initialization code which will call out to the runtime to initialize the constant.

By collecting all these initialization sequences and running them on startup we can intern all the symbols used by the program before it starts running. Each symbol reference thereafter is a simple memory load.

Using these cached symbols makes class lookup a lot easier too. Previously I stored all classes in a table keyed by symbol and searched it linearly when looking up classes. In the new implementation I store class objects in a “class” slot associated with the symbol representing the class’s name.

Similarly, the dispatch table was stored as a global list and searched first by symbol (selector) and then compared according to argument types. I moved the dispatch rules to the symbol objects, making the search much shorter.

These changes brought the instruction execution count down to about 40 billion instructions for Church (versus about 650 million for Lisp).

Further optimizations involved removing all cons’ing from the dynamic dispatch routines, inlining as many of those calls as possible to avoid function call overhead and rewriting recursive routines as tagbody loops.

I also reordered the tests in the “class-of” code to check the most common cases first.

All these optimizations brought the time down to about 4.5 seconds and the execution count to 13 billion. This is still about a factor of 10 slower than the lisp implementation. Eventually I hope to bring it to about a factor of 2, possible future optimizations are inline-caches for method lookup and optimized cons and closure allocation. Eventually I might also look at more sophisticated approaches, such as type inference, an optimizing compiler pass and runtime profiling.

To profile the code I tried two tools, Intel’s VTune and the callgrind module of valgrind. VTune was quite disappointing, besides having to paste serial numbers into an obscure installation utility (which failed the first few times) the download was 500 megabytes. After installation the sampling driver failed to work but I managed to run the call tracing module.

valgrind provides similar information to VTune, but the kcachegrind visualizer is much better, the call graph is very easy to work with and is also possible to see hot loops at the assembly level.

jolt-burg

In April I posted to the fonc mailing list about my port of Ian Piumarta’s “jolt-burg”.

This is a combination of a simple lisp-like language (“jolt”) and a compiler from a tree of “instruction” objects to native machine code.

My implementation was in Common Lisp and my compiler targeted relocatable ELF object files.

From my post:

Currently I have a compiler that can read sexp input, eg:

(define |main| (lambda ()
        ((call-c |dlsym| 0 "puts") "hello, dynamic world!")))

and then compile it down to a tree of blocks and instruction objects (see instruction.lisp or Instruction.st) 

BLOCK(--- NIL ---)                                                                                                                                                         
| | 0(--- VOID BLOCK --- 369                                                                                                                                               
| | | | | 0(+++ REG CALLI4 ---                                                                                                                                             
| | | | | | 0(+++ REG CNSTP4 --- reloc ---)                                                                                                                                
| | | | | | 0(+++ REG CNSTI4 --- 0)                                                                                                                                        
| | | | | | 0(+++ REG CNSTP4 --- reloc ---))                                                                                                                               
| | | | | 0(+++ REG CNSTP4 --- reloc ---))                                                                                                                                 
| | 0(--- VOID BLOCK --- 369                                                                                                                                               
| | | | | 0(+++ REG CALLI4 EBX                                                                                                                                             
| | | | | | 0(+++ REG CNSTP4 EBX reloc ---)                                                                                                                                
| | | | | | 0(+++ REG CNSTI4 EAX 0)                                                                                                                                        
| | | | | | 0(+++ REG CNSTP4 ECX reloc ---))                                                                                                                               
| | | | | 0(+++ REG CNSTP4 EAX reloc ---))                       

This tree is processed by the burg compiler to emit x86 machine code. (in the diagram we see registers being assigned to the tree nodes)

The last stage is when I collect the machine code generated by the burg compiler and combine it with data and relocation entries to generate an ELF object file.

I can then pass this to the unix linker (ld) to produce an executable.

ld --dynamic-linker=/lib/ld-linux.so.2 state/start.o state/hello.state.o -ldl  -lc -o myhello

Here you can see the disassembled elf object file with relocation entries for the "dlsym" symbol and the "puts" and "hello world" strings.

objdump -d state/hello.state.o -r

Disassembly of section .text:

00000000 
: 0: 55 push %ebp 1: 89 e5 mov %esp,%ebp 3: 53 push %ebx 4: 83 ec 14 sub $0x14,%esp 7: bb 00 00 00 00 mov $0x0,%ebx 8: R_386_32 dlsym c: b8 00 00 00 00 mov $0x0,%eax 11: b9 00 00 00 00 mov $0x0,%ecx 12: R_386_32 _data_286 16: 89 4c 24 04 mov %ecx,0x4(%esp) 1a: 89 04 24 mov %eax,(%esp) 1d: ff d3 call *%ebx 1f: 89 c3 mov %eax,%ebx 21: b8 00 00 00 00 mov $0x0,%eax 22: R_386_32 _data_287 26: 89 04 24 mov %eax,(%esp) 29: ff d3 call *%ebx 2b: 89 c0 mov %eax,%eax 2d: 89 c0 mov %eax,%eax 2f: 83 c4 14 add $0x14,%esp 32: 5b pop %ebx 33: 5d pop %ebp 34: c3 ret

You can view the code at:

http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/

(see the burg directory)

(click files to see the files, or run ‘hg clone <url>’ to get a local copy)