If you’re flashing a BIOS for an MSI motherboard (mine is a B85M-E33) using the M-FLASH utility available from BIOS setup, make sure that the partition on your USB flash drive is set to type 6 (FAT16) using the ‘t’ command in fdisk.
In my previous post I mentioned work on a new register allocator. Something that was missing from my Church-State project was error recovery. Most languages today use some form of try-catch exception handling (although the Go language uses deferred statements to perform cleanup).
The biggest challenge in implementing exception handling is that functions that use “callee-saved” registers may save these register values on the stack and then use them to store private values. Under normal control flow these values will be loaded from the stack and restored when the function returns. In the case of an exception, the runtime has to inspect each stack frame on the call stack between the point where the exception was thrown and the point where it will be caught and restore the values of these callee saved registers.
The implementation I chose was to emit a label for the function ‘epilogue’, the final sequence of statements before the return instruction, and store this in the object file. At exception handling time I have a routine implemented in assembly which will walk the stack and arrange to invoke this epilogue stub for each function (adjusting the stack pointer appropriately so that the function can find the saved values). Finally it jumps to the exception handler in the context of the function with the try-catch handler.
One complication arises in that it is not possible for variables that are used in the catch block to be stored temporarily in registers in the try block. The risk is that control is transferred out of the try block to the catch block and the catch block won’t know which register holds the current value. This can be avoided by treating every single statement that could raise an exception as having an extra control flow edge to the catch block. I implemented a simpler solution which was to mark all variables used in the catch block for stack allocation only. These variables are effectively permanently spilled to the stack.
The assembly routine to handle exceptions and restore callee saved values from the stack is pretty hairy but I’m quite happy with the solution.
7 ;; // Thread local storage has the following layout 8 ;; // (some other thread's heap) <------------ limit-pointer 9 ;; // heap 10 ;; // heap 11 ;; // heap 12 ;; // heap <------------ allocation-pointer 13 ;; // heap 14 ;; // heap 15 ;; // heap start of heap %r15 16 ;; // limit pointer -0x8 (%r15) 17 ;; // allocation pointer (initially points to start of heap) -0x10(%r15) 18 ;; // saved exception object -0x18(%r15) 19 ;; // saved return address when calling unwind stub -0x20(%r15) 20 ;; // saved frame pointer for exception trace -0x28(%r15) 21 ;; // working stack pointer to be used during stack trace traversal -0x30(%r15) 22 ;; // saved traversal stack pointer -0x38(%r15) ... ;; //-------------------------------------------------------------------------------- 57 ;; // Exception handling 58 ;; //-------------------------------------------------------------------------------- 59 60 61 ;; //transfer to exception handler 62 .globl __l_throw_exception 63 64 65 .macro switch_to_original_stack 66 mov %rsp, -0x30(%r15) 67 mov -0x38(%r15), %rsp 68 .endm 69 70 .macro switch_to_working_stack 71 mov %rsp, -0x38(%r15) 72 mov -0x30(%r15), %rsp 73 .endm 74 75 __l_throw_exception: 76 ;; // first argument is the exception object 77 ;; // save it in thread local memory 78 mov %rdi, -0x18(%r15) 79 ;; // also save the frame pointer so that we can generate a stack trace later if necessary 80 mov %rbp, -0x28(%r15) 81 ;; // and keep a pointer to a 'working stack' so that we call helper routines without destroying the original stack frames 82 mov %rsp, -0x30(%r15) 83 __find_exception_handler: 84 ;; // get the return address from the stack 85 ;; // and check if there is an exception handler installed around that address 86 mov (%rsp), %rdi 87 switch_to_working_stack 88 call __l_find_exception_handler 89 switch_to_original_stack 90 test %rax,%rax 91 jnz found_exception_handler 92 ;; // if there is no exception handler, unwind this frame by finding the unwind stub for this function 93 mov (%rsp), %rdi 94 switch_to_working_stack 95 call __l_find_unwind_stub 96 switch_to_original_stack 97 test %rax,%rax 98 jz no_unwind 99 ;; // pop off the return address (restore the stack to the calling function's expected value) 100 pop %rcx 101 ;; // get the return address of the previous frame and save it to thread local memory 102 mov 0x8(%rbp), %rdx 103 mov %rdx, -0x20(%r15) 104 ;; // do some magic to get the address of 'after_unwind_stub' 105 lea 0x6(%rip), %rcx 106 ;; // save it in place of the normal return address 107 mov %rcx, 0x8(%rbp) 108 ;; // branch to the unwind stub 109 jmp *%rax 110 after_unwind_stub: 111 ;; // now restore the real return address that we stored in thread local memory 112 mov -0x20(%r15), %rax 113 push %rax 114 ;; // repeat from the top until we find an exception handler 115 jmp __find_exception_handler 116 117 found_exception_handler: 118 ;; // load the exception object that we stored in thread local memory 119 mov -0x18(%r15), %rdi 120 ;; // pop old return address 121 pop %rcx 122 ;; // branch to exception handler 123 jmp *%rax 124 125 no_unwind: 126 no_exception_handler: 127 mov -0x28(%r15), %rdi 128 mov -0x18(%r15), %rsi 129 switch_to_working_stack 130 call __l_top_level_exception_handler 131 ;; // no return? 132 int $3 133
I’ve been reading a lot about register allocation algorithms, register allocation is probably one of the most complex parts of a compiler.
A SURVEY OF REGISTER ALLOCATION TECHNIQUES by Jonathan Protzenko is a great overview of what algorithms are suited to a particular compilation scenario.
I focused on the family of linear-scan register allocators first proposed by Poletto and Sarkar. The `c compiler has a really clean implementation of linear scan in C but I discovered later that the machine architectures that they targeted are all “register-rich” architectures which can easily accommodate a few registers reserved as “scratch registers”. Since I think that the “register poor” 32-bit x86 code is still important I researched more complex algorithms such as Linear scan with Interval Splitting and Extended Linear Scan.
Initially I was hoping for a really lightweight algorithm that could be used for generating code on the fly with the potential for an optimizing compiler to kick in later when necessary. As I implemented some of the interval splitting logic I realised that the implementation was getting too complex and I decided to fall back to a simple linear scan algorithm that uses an SCC (Strongly Connected Components) analysis to determine interval lifetimes. The implementation (on 64-bit x86) requires three scratch registers to be reserved and should probably be reconsidered for a 32-bit implementation.
The SCC analysis works by tracing out the control flow of the basic blocks and determines sets where each node can reach every other node. These are loops. Identifying loops is important for the lifetime of a variable because you need to keep it alive up until the end of the loop and at the top of the loop until the first use.
The intervals calculated by SCC are very coarse and as a result the allocator generates a lot of spills that could be avoided with a more fine-grained analysis or through interval splitting.
I implemented some rough visualizations of the interval lifetimes and allocations, unfortunately it doesn’t display very well here. Each range has a hex number indicating which register is has been allocated or “S” if it is spilled. On x86-64 we have to be careful to preserve function arguments which are passed via register (and also used in calls). I do this with “fixed intervals” which guarantee that a specific register is available for these special cases.
Int. M Spill Reg Ranges v26(f) 7777777777777777777777777 [0-24] v25(e) 8888888888888888888888888 [0-24] v24(d) 9999999999999999999999999 [0-24] v23(c) AAAAAAAAAAAAAAAAAAAAAAAAA [0-24] v22(b) BBBBBBBBBBBBBBBBBBBBBBBBB [0-24] v21(a) CCCCCCCCCCCCCCCCCCCCCCCCC [0-24] v5 5555555555555555555555555 [0-24] v4 44444444444444444444444 [0-22] v3 333333333333333333333 [0-20] v2 2222222222222222222 [0-18] v1 11111111111111111 [0-16] v0 000000000000000 [0-14] v15(x) 88888888888888888888888 [26-48] v27 777777 [26-31] v31 77777777777777777 [32-48] v30 99999999999999999 [32-48] v29 AAAAAAAAAAAAAAAAA [32-48] v28 BBBBBBBBBBBBBBBBB [32-48] v32 CCCC [48-51] v16(y) 88888888888888888888888 [52-74] v33 CCCCCC [52-57] v37 CCCCCCCCCCCCCCCCC [58-74] v36 77777777777777777 [58-74] v35 99999999999999999 [58-74] v34 AAAAAAAAAAAAAAAAA [58-74] v38 BBBB [74-77] v17(z) 88888888888888888888888 [78-100] v39 BBBBBB [78-83] v43 BBBBBBBBBBBBBBBBB [84-100] v42 CCCCCCCCCCCCCCCCC [84-100] v41 77777777777777777 [84-100] v40 99999999999999999 [84-100] v44 AAAA [100-103] v18(x1) 88888888888888888888888 [104-126] v45 AAAAAA [104-109] v49 AAAAAAAAAAAAAAAAA [110-126] v48 BBBBBBBBBBBBBBBBB [110-126] v47 CCCCCCCCCCCCCCCCC [110-126] v46 77777777777777777 [110-126] v50 9999 [126-129] v19(x2) 88888888888888888888888 [130-152] v51 999999 [130-135] v55 99999999999999999 [136-152] v54 AAAAAAAAAAAAAAAAA [136-152] v53 BBBBBBBBBBBBBBBBB [136-152] v52 CCCCCCCCCCCCCCCCC [136-152] v56 7777 [152-155] v20(x3) 88888888888888888888888 [156-178] v57 777777 [156-161] v61 77777777777777777 [162-178] v60 99999999999999999 [162-178] v59 AAAAAAAAAAAAAAAAA [162-178] v58 BBBBBBBBBBBBBBBBB [162-178] v62 CCCC [178-181] v63 CCCCCCC [182-188]
Since the last release of Church-State I started investigating whether it might be possible to make the runtime multi-threaded and to have the compiler execute on multiple threads.
This turned out to be quite a lot work. For every global variable used by different threads in the compiler and parsers I had to introduce locking mechanisms. In the runtime I had to ensure that there was way to allocate dynamically generated machine code into a thread-local buffer. Care has to be taken with inline caches and self-modifying code. Any output to shared streams (such as stdout and stderr) need to be synchronized in some way. The garbage collector needs to ensure that only one thread can allocate at a time and collection needs to be coordinated so that mutating threads don’t run during collection. My approach was to use signal handlers to interrupt threads, each interrupted thread records the extent of its stack and then waits for garbage collection to complete.
Copying garbage collector
I also started experimenting with a copying collector instead of the existing mark-sweep implementation. Copying collectors are harder to implement because whenever you move an object you have to ensure that all the references to the old object (including on the call stack) are updated to point to the new one. Hashtables that hash object pointers need to be made aware of a completed garbage collection (and rehash if necessary). I never finished the new garbage collector because I felt that State (the low level language used for implementing the runtime) was a limited by a lack of language features. State does not have something like structs in C or pointer arithmetic. It doesn’t have a real looping mechanism (I use a lisp-style DO macro which I find ugly without pointer arithmetic). It doesn’t have syntax for array/pointer dereferencing or low level operations like bit shifting and masking which are common in runtime code.
I decided it would better to experiment with a richer low-level language and so I started creating a compiler for a language code-named ‘baste’. Baste has a syntax derived from Church (with Python/Haskell like indentation) and a 64-bit backend (Church-State is only implemented for 32-bit systems). My current experiments involve implementing a new register allocator that calculates live intervals for each variable and then allocates registers using a linear-scan register allocation algorithm. This new work is included in the Church-State source repository.
New release of Church-State
I have made a new release of Church-State.
Skype for linux 4.2.0
This debian package also depends on a multiarch setup.
sudo dpkg -i skype-debian_188.8.131.52-1_i386.deb
Skype for linux 4.0.0
Installing skype on a 64-bit linux machine is tricky. There is no 64-bit binary.
The best instructions I found are here.
I had to install the following 32-bit libraries:
apt-get install libpcre3:i386 libglib2.0-0:i386 libasound2:i386 libc6:i386 libgcc1:i386 libqt4-dbus:i386 libqt4-network:i386 libqt4-xml:i386 libqtcore4:i386 libqtgui4:i386 libqtwebkit4:i386 libstdc++6:i386 libx11-6:i386 libxss1:i386 libxv1:i386 libssl1.0.0:i386
I also ran into a bug with the internal microphone on my IdeaPad U300s
In January I added some optimizations to my OMeta implementation to compute FIRST and FOLLOW sets for certain (hard-coded) grammar rules to try and predict which parsing rule to apply when faced with many options. This needs to be extended to automatically select rules that will benefit from this optimization.
I was recently looking at the GLL parser described by Elizabeth Scott and Adrian Johnstone. Does anyone know how easy it is to implement a parser like this? How does it compare (in implementation effort) with something like OMeta?
That reminded me of Matthew Might’s post on parsing with derivatives. I had a look at it again and I am very much compelled by the elegance of the implementation. However, looking at it closely reminds me of how much I suffer from Scheme Envy. The 500 line Racket implementation uses streams (lazy lists) and makes heavy use of hygienic macros. It uses dynamic binding (make-parameter / parameterize), delay/force, the pattern matching facility and weak hash tables. How many potential implementation targets provide facilities like these?
A parsing issue that worries me a lot is how to parse languages that rely on indentation (eg python, haskell) and still maintain some kind of composability for user defined grammars.
This is a new release of Church-State
In this release I have added a new hashtable implemented using the State language. Previously I was using a hashtable implemented in C from the kazlib library.
The new hashtable uses open addressing with double hashing to resolve probe collisions. Besides the standard hashtable which maps one key to one value, I have also implemented a two-key two-value hashtable used for the memoization of the OMeta parser results. For each grammar rule (key1) and the input position (key2) we store the result (value1) and the end position of the cached result (value 2). Previously I used two levels of hashtables for this.
You can see the hashtable here:
New garbage collector
I have also added a new garbage collector to replace the Boehm collector used so far. Like the Boehm collector this is a conservative mark and sweep collector.
The implementation is fairly simple, we create two heaps. The first is divided in pagelets which are 1024 bytes in size. Each of these pagelets can store objects which are 8,16,32,64,128,512 or 1024 bytes in size. We maintain bitfields to remember which objects are allocated within each page. When all the objects in a pagelet are freed during garbage collection that pagelet is returned to a free pool and may be used to store objects of a different size. As part of this work I have implemented a guard page and an associated signal handler to help detect when new free pagelets should be assigned to a size class.
The other heap is for objects larger than 1024 bytes. We use a simple first-fit strategy. To help mitigate the effect of the fragmentation of free spaces, we coalesce adjacent free spaces into larger blocks during the sweep phase.
My measurements show that the new allocator is about twice as slow as the Boehm allocator when allocating cons cells. For the entire bootstrap process the new allocator adds about 2 seconds of overhead, taking the bootstrap time from 23 to 25 seconds.
The new garbage collector is in state/runtime/state_runtime.state .
Recently I came across a reference to a paper describing early work on query plan generation for relational databases.
Unfortunately the original paper was a rather poor scan and I decided to recreate the original text and diagrams.
This turned out to be more work than I had anticipated, but I persevered and you can see the result here:
This is a new release of Church-State
This incorporates a new OMeta interpreter that I have written to replace the parser generator used in previous versions.
Implementing the Church parser with an interpreter rather than with a parser generator has two main benefits:
- There is no longer a requirement to have the church compiler available to generate a new parser. This caused some bootstrapping issues and made it harder to provide user-extensible parsing.
- The new approach is faster than before. Usually interpreters have more overhead than compiled equivalents but in this case the compilation overhead was very large for the generated church parser. Overall it’s faster to parse the church grammar file into an intermediate structure and interpret it than it is to generate and compile a very large generated parser.
The new system bootstraps in 28 seconds now (versus 34 seconds previously).
The most novel part of this new interpreter implementation is the way that I handle “grammar actions”.
Previously I implemented both a parser generator and an interpreter which used runtime code compilation to turn grammar actions into executable closures.
Take this example of the grammar rule for an assignment:
assignment = assignable-expression:lhs ws* "=" ws* expression(nil):rhs -> <<`[_assign ,lhs ,rhs]>>
which would parse the following Church code:
a = 1
In some parsing frameworks it is convenient to automatically construct a parse tree from a given grammar. In the case of scannerless parsers (which is what I use in Church) we would like to elide the nodes that would result from parsing whitespace. So in the example above the parts that we want to keep are labeled lhs and rhs. The section at the end describes a quasiquoted list template that has the values of lhs and rhs substituted into it.
When evaluated, this action will return something like this:
[_assign [_var "a"] [_number 1]]
To implement this one can create a closure with the following code:
fn arg1 arg2 -- `[_assign ,arg1 ,arg2]
which is called at parse time with the values for each labeled part.
This requires having the entire compiler framework available at parse time, including the parser which you are implementing!
To avoid this kind of bootstrapping issue and to simplify the parser, I decided to implement grammar actions with a simple quasiquoting syntax.
In the grammar actions implemented in the latest Church-State parser it is only possible to quote a list or symbol, unquote a variable or a list and specify a literal.
For example the following grammar rule:
pattern = cname:name ws* (ws* pattern-argument)*:args -> <<`(_pattern ,name ,@args)>>
creates a list starting with the symbol _pattern followed by the pattern name and then containing all the elements of the args list.
With closures it is possible to use list functions like cons, first and rest or to call routines to convert strings of digits into integers. This is not possible with quasiquoted actions and they require a later pass of the parse tree to do the necessary conversions. I have added routines to perform these transformations so that the resulting parse tree is exactly the same as that returned by the previous parser implementations.
The interpreter can be seen here: