How can I store sorted views

From: Steven M. Kearns <[email protected]> - 16 Feb 1999

I have a bag of keyword->id mappings that I need to store, and retrieve by keyword. Note that this is a bag, in that there are multiple ids for one keyword.

Ok. If there are *many* ID's, a subview might be more suitable, i.e.:

         items[keyword:S,ids[value:I]]

This stores keywords only once, but I wouldn't do it if there are just a few ID's per keyword.

--> I think you would suggest making a View with properties "keyword" and "id" and then creating a sorted view on "keyword".

Yes :)

My question is: I want to be able to store the sorted view on disk, so that it does not have to be recomputed every time. But I still want it to be updated when I delete or add items to the underlying view.

Well... derived views cannot persist. Two comments:

  - If you are worried by startup time for setting up the sorted view,
    all I can say is: try it.  You might be surprised, it might be ok.
  - The other way to do this is to keep this sorted yourself.  In that
    case, Metakit becomes a persistent vector manager for you, and you
    would have to do the same as you would with any other C++ container.

In addition, I need to iterate through the sorted derived view, starting with the first item >= k. However, it appears that all of the view member functions that I saw require an exact match to k, instead of the first item >= k.

By "k" I assume you mean a keyword value, not an integer row index.

There's no function for that, you're right. The derived "select" view will deal with ranges such as [k..max], but it's not efficient to set it up each time, i.e. if "k" is different every time.

I suppose I could do a binary search on the sorted derived view....

Have a look at c4_View::Search - it does binary search for you. It returns the closest match (the index of the first key >= k, I think). It works by *assuming* the view is properly sorted (derived or not).

Finally, can you give any performance guidelines, a la "a commit requires O(log N)*M steps, where M is the number of inserts or deletes you did, and N is the total number of items in the view."

Performance is very, very hard to predict in Metakit. Anything from jaw-dropping to total despair is possible...

Commit is currently related to the *total* number of subviews and the number and size of those views which have been modified. The number of inserts/deletes in a single view has almost no influence - this is caused by the columnar storage format used inside Metakit. It is a fact that doing a commit after each row change is rather expensive. On the other hand, incrementing all IDs by one and committing would be fast.

I can only repeat: performance in Metakit is often counter-intuitive. A test (two years ago) comparing sort performance with MS Jet placed Metakit at 30 times as fast in one case. There are enough tricks inside (such as using a roving gap to greatly reduce data movement in changes affecting adjacent rows), to make it nearly impossible to give general guidelines. I understand that this is not a very useful answer, but it's the best I can up with. You'll have to do a few timing tests.

The only other comment I can add is: be sure to try simple, brute force, approaches before starting to come up with smart solutions. On a Pentium II / 400, linear string scans can exceed 500,000 compares /sec.

-- JC