Parsing

April 2nd, 2012

In January I added some optimizations to my OMeta implementation to compute FIRST and FOLLOW sets for certain (hard-coded) grammar rules to try and predict which parsing rule to apply when faced with many options. This needs to be extended to automatically select rules that will benefit from this optimization.

I was recently looking at the GLL parser described by Elizabeth Scott and Adrian Johnstone. Does anyone know how easy it is to implement a parser like this? How does it compare (in implementation effort) with something like OMeta?

That reminded me of Matthew Might’s post on parsing with derivatives. I had a look at it again and I am very much compelled by the elegance of the implementation. However, looking at it closely reminds me of how much I suffer from Scheme Envy. The 500 line Racket implementation uses streams (lazy lists) and makes heavy use of hygienic macros. It uses dynamic binding (make-parameter / parameterize), delay/force, the pattern matching facility and weak hash tables. How many potential implementation targets provide facilities like these?

A parsing issue that worries me a lot is how to parse languages that rely on indentation (eg python, haskell) and still maintain some kind of composability for user defined grammars.

A new garbage collector

October 1st, 2011

This is a new release of Church-State

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-7.tar.gz

New hashtable

In this release I have added a new hashtable implemented using the State language. Previously I was using a hashtable implemented in C from the kazlib library.

The new hashtable uses open addressing with double hashing to resolve probe collisions. Besides the standard hashtable which maps one key to one value, I have also implemented a two-key two-value hashtable used for the memoization of the OMeta parser results. For each grammar rule (key1) and the input position (key2) we store the result (value1) and the end position of the cached result (value 2). Previously I used two levels of hashtables for this.

You can see the hashtable here:

http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/file/f32bc9d22355/church/runtime/hashtable.state

New garbage collector

I have also added a new garbage collector to replace the Boehm collector used so far. Like the Boehm collector this is a conservative mark and sweep collector.

The implementation is fairly simple, we create two heaps. The first is divided in pagelets which are 1024 bytes in size. Each of these pagelets can store objects which are 8,16,32,64,128,512 or 1024 bytes in size. We maintain bitfields to remember which objects are allocated within each page. When all the objects in a pagelet are freed during garbage collection that pagelet is returned to a free pool and may be used to store objects of a different size. As part of this work I have implemented a guard page and an associated signal handler to help detect when new free pagelets should be assigned to a size class.

The other heap is for objects larger than 1024 bytes. We use a simple first-fit strategy. To help mitigate the effect of the fragmentation of free spaces, we coalesce adjacent free spaces into larger blocks during the sweep phase.

My measurements show that the new allocator is about twice as slow as the Boehm allocator when allocating cons cells. For the entire bootstrap process the new allocator adds about 2 seconds of overhead, taking the bootstrap time from 23 to 25 seconds.

The new garbage collector is in state/runtime/state_runtime.state .

System R

May 7th, 2011

Recently I came across a reference to a paper describing early work on query plan generation for relational databases.

Unfortunately the original paper was a rather poor scan and I decided to recreate the original text and diagrams.

This turned out to be more work than I had anticipated, but I persevered and you can see the result here:

Access Path Selection in a Relational Database Management System.

New OMeta interpreter for Church-State

March 28th, 2011

This is a new release of Church-State

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-6.tar.gz

This incorporates a new OMeta interpreter that I have written to replace the parser generator used in previous versions.

Implementing the Church parser with an interpreter rather than with a parser generator has two main benefits:

  • There is no longer a requirement to have the church compiler available to generate a new parser. This caused some bootstrapping issues and made it harder to provide user-extensible parsing.
  • The new approach is faster than before. Usually interpreters have more overhead than compiled equivalents but in this case the compilation overhead was very large for the generated church parser. Overall it’s faster to parse the church grammar file into an intermediate structure and interpret it than it is to generate and compile a very large generated parser.

The new system bootstraps in 28 seconds now (versus 34 seconds previously).

Grammar actions

The most novel part of this new interpreter implementation is the way that I handle “grammar actions”.

Previously I implemented both a parser generator and an interpreter which used runtime code compilation to turn grammar actions into executable closures.

Take this example of the grammar rule for an assignment:

assignment = assignable-expression:lhs ws* "=" ws* expression(nil):rhs -> <<`[_assign ,lhs ,rhs]>>

which would parse the following Church code:


a = 1

In some parsing frameworks it is convenient to automatically construct a parse tree from a given grammar. In the case of scannerless parsers (which is what I use in Church) we would like to elide the nodes that would result from parsing whitespace. So in the example above the parts that we want to keep are labeled lhs and rhs. The section at the end describes a quasiquoted list template that has the values of lhs and rhs substituted into it.

When evaluated, this action will return something like this:
[_assign [_var "a"] [_number 1]]

Closures

To implement this one can create a closure with the following code:

fn arg1 arg2 -- `[_assign ,arg1 ,arg2]

which is called at parse time with the values for each labeled part.

This requires having the entire compiler framework available at parse time, including the parser which you are implementing!

To avoid this kind of bootstrapping issue and to simplify the parser, I decided to implement grammar actions with a simple quasiquoting syntax.

Quasiquoted actions

In the grammar actions implemented in the latest Church-State parser it is only possible to quote a list or symbol, unquote a variable or a list and specify a literal.

For example the following grammar rule:

pattern = cname:name ws* (ws* pattern-argument)*:args -> <<`(_pattern ,name ,@args)>>

creates a list starting with the symbol _pattern followed by the pattern name and then containing all the elements of the args list.

With closures it is possible to use list functions like cons, first and rest or to call routines to convert strings of digits into integers. This is not possible with quasiquoted actions and they require a later pass of the parse tree to do the necessary conversions. I have added routines to perform these transformations so that the resulting parse tree is exactly the same as that returned by the previous parser implementations.

The interpreter can be seen here:

http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/file/cbea15e41381/genesis/ometa/ometa-interpreter.church

along with the grammar for parsing ometa grammars:

http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/file/d6a2c4d7c783/genesis/ometa/ometa-parser.g

Church alpha 5

February 4th, 2011

This is a new release of Church-State.

http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-5.tar.gz

To do a bootstrap, type make bootstrap1 and make bootstrap2.

To test the new ometa interpreter, run make ometa-test-parser and run it with ./bin/ometa-test-parser test.g test.txt

This release contains the following new items:

  • A new OMeta interpreter (Although this new interpreter is capable of parsing the full Church grammar, I’m still keeping the old OMeta parser generator for bootstrapping reasons)
  • Optimized dispatch structures for faster dispatch when inline caching is not effective
  • Improved inline caches

Climbing trees

August 9th, 2010

Here is a solution to an ACM programming problem requiring one to analyse trees describing parental relationships.

I wrote this in Maude even though the classic tool for this kind of problem is Prolog.

The solution
The output

We start by importing integers and quoted identifiers (symbols):

mod CLIMBING is
protecting INT .
protecting QID .

Next we define the sorts used in the program. Rel means any kind of relationship, NamePair is a just pair of names. The sorts Parent, Sibling and Cousing are all relationship types.


sort Rel .

sorts NamePair .
sorts Parent Sibling Cousin RelType .
subsorts Parent Sibling Cousin < RelType .

The following operators (or constructors) are used to construct various relationship facts.


op Parent : Nat -> Parent .
op Cousin : Nat Nat -> Cousin .
op Sibling : -> Sibling .

op pair : Qid Qid -> NamePair .
op rel : NamePair RelType -> Rel .
op parent : Qid Qid -> Rel .

op Empty : -> Rel [ ctor ] .
op __ : Rel Rel -> Rel [ ctor assoc comm id: Empty ] .

The problem set is initialized with the following set of parent-child facts

op init : -> Rel .
eq init =
parent('alonzo.church, 'oswald.veblen)
parent('stephen.kleene, 'alonzo.church)
parent('dana.scott, 'alonzo.church)
parent('martin.davis, 'alonzo.church)
parent('pat.fischer, 'hartley.rogers)
parent('mike.paterson, 'david.park)
parent('dennis.ritchie, 'pat.fischer)
parent('hartley.rogers, 'alonzo.church)
parent('les.valiant, 'mike.paterson)
parent('bob.constable, 'stephen.kleene)
parent('david.park, 'hartley.rogers) .

We use the following variables to name various objects in our rules:


vars A B : Qid .
vars X Y Z : Qid .

vars N M O P : Nat .
vars R : RelType .
vars REST : Rel .

These four rules describe the different relationship types:


---- Note that the input is in (child, parent) format, but we want to output (parent, child)

rl [ parent ] : parent(A, B) => rel( pair(B, A), Parent(0)) .

rl [ grandparent ] :
rel( pair(X, Y), Parent(0))
rel( pair(Y, Z), Parent(N)) => rel( pair(X, Z), Parent(N + 1)) .

---- sibling is reflexive

rl [ sibling ] :
rel( pair(X, Y), Parent(0))
rel( pair(X, Z), Parent(0)) => rel( pair(Y, Z), Sibling) rel( pair(Z, Y), Sibling) .

---- check least ancestor
rl [ cousin ] :
rel( pair(X, Y), Parent(N))
rel( pair(X, Z), Parent(M)) => rel( pair(Y, Z), Cousin(min(N,M), abs(N - M))) .
endm

These search expressions describe the input queries (we could make an effort to parse these properly instead)


search [1] in CLIMBING : init =>+ rel (pair ('stephen.kleene, 'bob.constable), R) REST .
search [1] in CLIMBING : init =>+ rel (pair ('hartley.rogers, 'stephen.kleene), R) REST .

----- search [1] in CLIMBING : init =>+ rel (pair ('les.valiant, 'alonzo.church), R) REST .
----- swap this query to search for father instead of child

search [1] in CLIMBING : init =>+ rel (pair ('alonzo.church, 'les.valiant), R) REST .

search [1] in CLIMBING : init =>+ rel (pair ('les.valiant, 'dennis.ritchie), R) REST .
search [1] in CLIMBING : init =>+ rel (pair ('dennis.ritchie, 'les.valiant), R) REST .
search [1] in CLIMBING : init =>+ rel (pair ('pat.fischer, 'michael.rabin), R) REST .

quit

Loop macros

June 4th, 2010

I have released the latest version of Church-State here:
http://subvert-the-dominant-paradigm.net/~jewel/church/church-alpha-4.tar.gz

You can also get the source code via mercurial:

hg clone http://subvert-the-dominant-paradigm.net/repos/hgwebdir.cgi/bootstrap/

or browse it.

In the previous version of Church loop constructs were expanded in the Church compiler directly to the State “tagbody” form. I have removed this code and implemented it as a Church macro which expands to a new Church “tagbody” form.

The aim behind this is to make it easier to do optimization and flow analysis on Church code in the future.

User extensible parsing

March 29th, 2010

I have been experimenting with user-extensible parsers in Church, even though I don’t have a use for them yet.

I added a new language construct called “extend-grammar”, ie:


extend-grammar ometa testg {
test-rule = "test" ws+ cname:a -> << 42 >>
}

This test-rule will match the string “test”, followed by a name. The rule will ignore its input and return the value 42.

This grammar is processed during parse time, converted to church code and dynamically compiled and linked into the running process.

At present this happens after the whole file is parsed, so it’s currently not possible to add a new grammar rule and use it in the same source file.

To activate the rule I added another construct, “eval-when”, ie:


eval-when compile load
      church-add-parser-extension 'test-rule
                                                                                    

What this does is to execute this code when this file is compiled (load doesn’t work yet).

In this example we add ‘test-rule to the list of ometa functions to be called by a special new grammar rule called ‘user-form.

The new version of ‘user-form is compiled and linked into the process, immediately making it available to the parser.

For this test case I then load the following file:


dotest
        print "in dotest"
        print (test notanumber)

which prints 42 because the parser has intercepted what would normally be a method call.

Rude repl

March 14th, 2010

I have added eval and a repl (read eval print loop) to Church.

> + 3 4
7
> length "foobar"
6
> eval "* 19 3"
57
> load "church/test/hello.church"
true
> (main)
"hello"
true

Up to now I have only been using Church as a command-line compiler which produces executable files. Yet I have always preferred interactive language environments (lisp, smalltalk, python etc) to stop-compile-run languages.

All the machinery for writing a repl has been in Church for a while, including

  • Church parser available as a library
  • Church compiler available at runtime
  • Machine code generator and dynamic linker available at runtime
  • The ability to modify the dispatch table at runtime

This makes the repl easy to implement:

repl
        loop
                do
                        rstr = (read-line)
                        if (null? rstr)
                                return-from repl nil
                        else
                                print (eval rstr)

Eval is a little bit more messy because the compiler is designed to compile a whole method at a time and doesn’t know what to do with variables that are not either a global or local variable.

To implement eval I wrapped the eval string in a lambda that I return from a method which gets run after compilation.

set-eval-compiled-function
   eval-compiled-function = (fn -- eval_str)

This works for certain language expressions, but does not presently provide the ability to assign to “repl” variables.

Another drawback is that multiline statements are currently not possible (unlike python and lisp).

In the future I hope to experiment with a SLIME-like extension to emacs which communicates to Church across a socket, allowing interactive evaluation and compilation of source code from within the editor.

Prolog in Church

November 24th, 2009

As a step towards implementing an experimental type inferencing or type checking system I have built a prolog interpreter in Church. The interpreter is based on the example in Peter Norvig’s book, “Paradigms of Artificial Intelligence Programming”.

As a first step I wrote the grammar for parsing prolog programs. Here is a sample ‘parent’ program describing parent and grandparent relationships between Debian derivatives:


parent(ubuntu,kubuntu).
parent(ubuntu,edubuntu).

parent(debian, ubuntu).

grandparent(X,Z) :- parent(X,Y), parent(Y,Z).

and here is the OMeta grammar for parsing prolog programs:


ometa prolog <: ometa {

comment = "/*" (~cnewline anything)* cnewline -> << '_comment >>,

ws = $  | $	,
wsnl = ws | cnewline,

prolog-top-levels = (comment | prolog-rule)*,

prolog-rule = wsnl* prolog-clause:rhead ws* (":-" ws* prolog-or:body -> << body >>)* ws* $. wsnl*  -> <<`[rule ,rhead ,body]>>,

prolog-or = listof("prolog-impl", ";"):e  wsnl* -> << `[or ,e] >>,
prolog-impl = wsnl* prolog-clause:a wsnl* "->" wsnl* prolog-or:b wsnl* -> << `[implies ,a ,b] >> | prolog-and,
prolog-and = wsnl* listof("prolog-expr", ","):e wsnl* -> << `[and ,e] >>,

prolog-clause = wsnl* prolog-name:name wsnl* prolog-arg-list:args wsnl* -> <<`[,name | ,args]>>,
prolog-arg-list = $( listof("prolog-expr", ","):args wsnl* $) wsnl* -> << args >>,

prolog-expr = wsnl* (
	         $[ wsnl* $] wsnl* -> << `[]>> |
	      	 $[ wsnl* listof("prolog-expr", ","):list-head ws* ($| wsnl* prolog-expr)*:list-tail wsnl* $] wsnl* -> << (append! list-head (if list-tail (first list-tail) nil)) >> |
	      	 $( wsnl* prolog-or:e wsnl* $) wsnl* -> << e >> |
		 prolog-clause:c wsnl* -> << c >> |
                 prolog-number |
                 prolog-variable:v wsnl* -> << v >>),

prolog-number = "-"*:sign digit+:d -> << (convert-number sign d)>>,

prolog-variable = (letter | digit | $_)+:l -> << (intern (coerce l 'string)) >>,

prolog-name = (letter | digit | $_)+:l -> << (intern (coerce l 'string)) >>  }

The core of prolog lies in the unification and backtracking algorithms.

Unification will try to match its two inputs or else bind a variable to the corresponding value:


unify x y bindings
	cond
		(eq? bindings prolog-fail) prolog-fail
		(eq? x y) bindings
		(eq? x '_) bindings
		(eq? y '_) bindings
		(variable-symbol? x) (unify-variable x y bindings)
		(variable-symbol? y) (unify-variable y x bindings)
		(and (cons? x) (cons? y)) (unify (rest x) (rest y) (unify (first x) (first y) bindings))
		true prolog-fail

unify-variable var x bindings
	cond
		(get-binding var bindings) (unify (lookup var bindings) x bindings)
		(and (variable-symbol? x) (get-binding x bindings)) (unify var (lookup x bindings) bindings)
		(and occurs-check? (occurs-check var x bindings)) prolog-fail
		true (extend-bindings var x bindings)

Backtracking is achieved by allowing each clause to provide multiple solutions and trying all of these possibilities:


prove-all pi:prolog-interpreter goals bindings
	cond
		(eq? bindings prolog-fail) nil
		(null? goals) (list bindings)
		true
			mapcan (fn goal1-solution
				prove-all pi (rest goals) goal1-solution
) (prove pi (first goals) bindings)

As a test I used this map coloring program from a prolog tutorial:


member(X,[X|_]).
member(X,[_|List]) :- member(X,List).

adjacent(X,Y,Map) :-  member([X,Y],Map) ; member([Y,X],Map). 

find_regions([],R,R).
find_regions([[X,Y]|S], R,A) :-
 (member(X,R) ->
  (member(Y,R) -> find_regions(S,R,A)     ; find_regions(S,[Y|R],A))  ;
  (member(Y,R) -> find_regions(S,[X|R],A) ; find_regions(S,[X,Y|R],A) )). 

color(Map,Colors,Coloring) :-
        find_regions(Map,[],Regions),
        color_all(Regions,Colors,Coloring),
        not(conflict(Map,Coloring)).
color_all([R|Rs],Colors,[[R,C]|A]) :-
        member(C,Colors),
        color_all(Rs,Colors,A).
color_all([],_,[]). 

conflict(Map,Coloring) :- member([R1,C],Coloring),
        member([R2,C],Coloring),
        adjacent(R1,R2,Map). 

map1([[1,2],[1,3],[1,4],[1,5],[2,3],[2,4],[3,4],[4,5]]).

which yields the following colourings:


"query"
[[map1 M] [color M [red green blue yellow] Coloring]]

M = [[1 2] [1 3] [1 4] [1 5] [2 3] [2 4] [3 4] [4 5]]
Coloring = [[5 red] [4 green] [3 red] [1 blue] [2 yellow]]
;
M = [[1 2] [1 3] [1 4] [1 5] [2 3] [2 4] [3 4] [4 5]]
Coloring = [[5 red] [4 green] [3 red] [1 yellow] [2 blue]]
...

You can browse the source files for the prolog interpreter here.