My passion is data. data which represents facts, and data which is derived from other data by some form of processing. Computers compute, of course, but more importantly: they manage data.

I spent decades of my life in search of solutions to better deal with data. As a programmer, that search has consisted mostly of trying to develop new software. First there was "DABA", a C library which implements a multi-user database and adds set operations to construct selections. Then came "Metakit", a spin-off from a scientific data visualization tool called "MetaChrom". Metakit has a hybrid relational / hierarchical data model, with a column-wise internal storage format. Metakit rocks. It's simple, small, and extremely efficient when used appropriately.

Metakit has been an ongoing project for nearly a decade by now. It went open source, it was optimized in several ways to offer dramatic improvements in speed and scalability. It got a range of view operators along the way, making it possible to build an SQL engine on top. In these past years, I have learned a huge amount about software performance and flexibility and portability.

Metakit has been a great success. It is in use as the data storage foundation for a number of commercial projects. It manages all AddressBook data on every Macintosh computer running OS X. It is the basis on a widely used packaging system for Tcl-based application, called "Starkits". Metakit has a small but dedicated community of developers who like its data model, speed, and stability.

Yet at the same time, Metakit is slowly driving me insane. It is so full of untapped potential, that I can only see the limitations and design flaws. By now, I have an immense list of things I'd do differently. If Metakit is so great, wait until you see what I'm really up to!

Ah... but there lies the rub.

I have spent several years by now, trying to come up with a better designed core engine, able to support a far larger and cleaner set of features. Some things stand out as working so well, that they can be kept as is: the file format and the view / sub-view data model are fine. Other things really could use a major overhaul: view operators are built on top of too many different (and incongruent) basic mechanisms. There is a start with dynamic change propagation across views, but it is too limited. The commit extend / commit aside code is unfinished.

As I ponder incessantly on the way to design the next implementation, the choice of implementation language itself becomes an issue: C++ has turned out to be overkill and far too hard to port and embed. At the same time, the goal of having a powerful database should not be limited to any one language. Whatever the next version is written in, it must be usable from a range of languages. This set should at least include: C, C++, C#, Java, and several scripting languages.

So how does one write for embeddability? For now, the choice is no doubt to code in plain C. But C is awfully primitive and tedious. Scripting languages are so much more flexible. Even on the performance side, C requires trade-offs. Machine language in a few places can have a huge impact, such as for hardware-assisted vector operations.

Another option is Forth. But Forth is even lower level. Memory management is a nightmare. Forth was invented well before automatic garbage collection became commonplace.

So here I am, implementing a new virtual machine (in C for now), and bootstrapping myself into a range of languages on top of it, from Forth-like to C-like. It's working out well, except for one detail: the task is monumental. It may never be done.

Don't get me wrong. The past year or two has consisted of very clear progress. The virtual machine is small, fast, powerful, and more or less done. It implements key concepts and so far I can only get more excited as each new challenge turns out to have a clean, simple, effective solution. The same holds for a low-level Forth-like language. That too is coming together very nicely.

But with so much designed from scratch, comes the opportunity to change things on every imaginable level. I am now considering switching to a 1-based indexing scheme, which has a few implications for some of the lower-level code and the VM. This means that everything I have written so far needs to be reviewed (and essentially: re-debugged). Such changes are extremely disruptive: they break everything written so far. There is always a small chance that the changes turn out to be bad, but even if not: until I get it all running again, I am basically making no measurable progress at all.

That lack of tangible, "showable", progress is starting to get to me. My number one risk is to end up more and more tweaking things which are not truly essential for reaching the bigger goal of a new Metakit implementation.

Perfection is not only elusive, it also has its limits.

For now, all I can do is try and stay aware of this issue and restrain myself - where possible - from chasing phantoms. The challenge is to define progress in such a way that the opportunities of building from scratch are explored far enough, but not get dogged down in more and more perfection.

Chuck Moore, the inventor of Forth, designed a language. But he didn't like the chips it ran on, so he designed new chips. But he didn't like the CAD software used to design chips, so he built his own CAD system. Did he drown in perfection? Who knows. It's a sad fate - I don't want it to be mine...

© October 2004