Heterogenous tree data structures

From: Renzo Tomaselli <[email protected]> - 19 Jun 1998

 > [...] a hierarchical system such as the Windows Registry; e.g.
 > a system of nested folders containing named values, where value
 > types belong to a predefined, closed set. Unlike Catfish, folders
 > have a heterogeneous structure, so topology cannot be separated from
 > contents (no normalization). So I would implement it using a view per
 > folder, with a single row: each property would be the named value;
 > this is easied by the dynamic behaviour of Metakit in
 > adding/removing properties on the fly. This structure could nest
 > deeply. Since this model would underuse Metakit, do you think the
 > overhead of deeply nested folder/one-row-views would be
 > acceptable from Metakit implementation point of view, in terms of
 > access time for retrieving/updating and overall file size ? (one
 > value type might be a true table, so the game there becomes very
 > interesting ...)

The shortest answer: if the number of values is no more than - say - a thousand, I think everything will work fine in any approach you choose.

But the underlying issues are much deeper, and extremely important. In Metakit 1.8, the use of such data structures is tedious at best. Which reduces its effectiveness for registries, XML data, and such.

There are several ways out (this is pretty advanced stuff, I'm afraid):

 1.  Move all properties into a subview.  I.e. instead of a row:
                 data[name:S,address:S,city:S]
     use a subview of the form:
                 data[fields[type:S,value:S]]
     Then add three rows: (name,...), (address,...), and (city,...)
     This is quite workable, not too inefficient, and can be adapted
     to deal with different datatypes (though not elegantly).
 2.  Implement subclassing, if there are only a few variant formats,
     or the more general case: allowing any combination of properties
     in any row.  This would be possible, by extending Metakit at a
     very low level, but might throw away many performance advantages.
     I don't currently consider this a good solution.
 3.  Use (abuse?) rows and subviews in the way you describe.  The main
     drawback is that it might become unwieldy, and perhaps inefficient.
 4.  Parallel views: split common properties and sets of subclasses into
     different view, joining them side by side when needed.  This is
     not easy to explain quickly: assume one master view with all common
     properties.  Add a single property, sort of a "variant" code.  Then
     depending on this code, pair this row with another row in the
     appropriate row of another view with additional properties.  Each
     row can have different variants.  In the most extreme case, put
     each property in a separate view, and use 0/1 flags to include 'em.
     There are some neat tricks to avoid storing reow indexes across
     such related views.  This is the scheme I have been working on,
     and which will be included in a future release.

For now, my suggestion would be to use a form of "indirection", a scheme which resembles approach 4 somewhat, and is relatively easy to implement on top of Metakit: add a subview for each variant. Allocate rows as needed, and store the name of that view and the index in the master view. In other words, put all common properties in one view (might be none). Then add two new fields to it: "subName:S,subIndex:I". When you have a new type of variant, create a subview, give it some unique name, and allocate a free row in it. Then store those in the two properties added to the main view.

In Metakit 1.8, it is possible to create a c4_CustomView derivative, which knows how to find variant properties. But such a view has to be read-only for now. Making changes requires new calls which explicitly put everything in the right place, i.e. an extra layer around Metakit.

Whether variants contain string, ints, or subview makes no difference. In other words, it should be possinle to add tables just as easily as scalar data values.

-- JC

August 1999, good news: a new design has been prepared, and even tried as prototype recently (thanks to Christian Tismer), which add the ability to handle heterogenous data structures with excellent locality of reference. Some additions to MK are expected to be ready and fully tested well before the end of 1999.

Februari 2000, not so good news: the above comment is correct, but misleading. The changes I mentioned have not been fully implemented yet. The idea is to store entire Metakit datastructures as memo fields inside another Metakit file. Right now, the only practical way to do this is by serializing and de-serializing the Metakit data (using the storage-stream interface) into a binary object and then storing that binary data as memo field. This works, but is not maximally efficient. There is a very nice example of this approach in Python in the source code distribution (python/millions.py). I had started to extend MK to allow direct access to such nested datastructures (with the full power of memory-mapping and commit security), but haven't so far been able to fully complete this work. There was a complication, which can definitely be solved, but at that point I admit that I got distracted again... I dare no longer say when this will be completed, since my track record for vaporware seems to be getting worse.

September 2001, good news again: use e4Graph, which can be found at http://e4graph.sourceforge.net. (JYL)

May 2002, Some more good news: e4Graph is now in its sixth Alpha release, which will be its last Alpha release. It now has a C++ and Tcl binding, the next release will likely have a Java binding too. See http://e4graph.sourceforge.net. (JYL)