Metakit for Python

The structured database which fits in the palm of your hand

[ Terminology | Installation | Getting started | Mk4py Reference ]

Buzzwords - Metakit is an embeddable database which runs on Unix, Windows, Macintosh, and other platforms. It lets you build applications which store their data efficiently, in a portable way, and which will not need a complex runtime installation. In terms of the data model, Metakit takes the middle ground between RDBMS, OODBMS, and flat-file databases - yet it is quite different from each of them.

Technology - Everything is stored variable-sized yet with efficient positional row access. Changing an existing datafile structure is as simple as re-opening it with that new structure. All changes are transacted. You can mix and match software written in C++, Python, and Tcl. Things can't get much more flexible...

Python - The extension for Python is called "Mk4py". It provides a lower-level API for the Metakit C++ core extension than an earlier version of this interface, and uses SCXX by Gordon McMillan as C++ glue interface.

Mk4py 2.4.9.6 - is a final/production release. The homepage points to a download area with pre-compiled shared libraries for Unix, Windows, and Macintosh. The Metakit source distribution includes this documentation, the Mk4py C++ source code, a "MkMemoIO.py" class which provides efficient and fail-safe I/O (therefore also pickling) using Metakit memo fields, and a few more goodies.

License and support - Metakit 2 and up are distributed under the liberal X/MIT-style open source license. Commercial support is available through an Enterprise License. See the license page for details.

Credits - Are due to Gordon McMillan for not stopping at the original Mk4py and coming up with a more Pythonic interface, and to Christian Tismer for pushing Mk4py way beyond its design goals. Also to GvR and the Python community for taking scripting to such fascinating heights...

Updates - The latest version of this document is at http://www.equi4.com/metakit/python.html


Terminology

There are several ways to say the same thing, depending on where you're coming from. For example, the terms table, list, collection, array, sequence, and vector all denote a more or less similar concept. To help avoid confusion, Metakit uses a simple (but hopefully precise) terminology.

The terms adopted by Metakit can be summarized as follows:

A few more comments about the semantics of Metakit:


Installation

  1. Download the latest version from http://www.equi4.com/pub/mk/
  2. On Unix, rename the appropriate compiled extension to "Mk4py.so" (on Win/Mac, use the corresponding file)
  3. Place the Mk4py extension as well as the "metakit.py" wrapper somewhere on Python's module search path,
    such as in the site-packages directory (or just leave it in ".")
  4. Do a small test, by running "demo.py". If all is well, you should get some self-explanatory output


Getting started

Create a database:
import metakit
db = metakit.storage("datafile.mk",1)
Create a view (this is the Metakit term for "table"):
vw = db.getas("people[first:S,last:S,shoesize:I]")
Add two rows (this is the Metakit term for "record"):
vw.append(first='John',last='Lennon',shoesize=44)
vw.append(first='Flash',last='Gordon',shoesize=42)
Commit the changes to file:
db.commit()
Show a list of all people:
for r in vw: print r.first, r.last, r.shoesize
Show a list of all people, sorted by last name:
for r in vw.sort(vw.last): print r.first, r.last, r.shoesize
Show a list of all people with first name 'John':
for r in vw.select(first='John'): print r.first, r.last, r.shoesize


Mk4py Reference

  1. Module functions
  2. Storage objects
  3. View objects
  4. Derived views
  5. View operations
  6. Mapping views
  7. Rowref objects
  8. Property objects

1. Module functions

These functions live at the module level. You can use them as described below after executing the following preamble:
     import metakit
     print metakit.version

SYNOPSYS

db = metakit.storage()
Create an in-memory database (can't use commit/rollback)
db = metakit.storage(file)
Use a specified file object to build the storage on
db = metakit.storage(name, mode)
Open file, create if absent and rwflag is non-zero. Open read-only if mode is 0, r/w if mode is 1 (cannot be shared), or as commit-extend if mode is 2 (in mode 1 and 2, the file will be created if needed).
vw = metakit.view()
Create a standalone view; not in any storage object
pr = metakit.property(type, name)
Create a property (a column, when associated to a view)
vw = metakit.wrap(sequence, proplist, byPos=0)
Wraps a Python sequence as a view
ADDITIONAL DETAILS
storage - When given a single argument, the file object must be a real stdio file, not a class implementing the file r/w protocol. When the storage object is destroyed (such as with 'db = None'), the associated datafile will be closed. Be sure to keep a reference to it around as long as you use it.

wrap - This call can be used to wrap any Python sequence, it assumes that each item is either a dictionary or an object with attribute names corresponding to the property names. Alternately, if byPos is nonzero, each item can be a list or tuple - they will then be accessed by position instead. Views created in this way can be used in joins and any other view operations.

2. Storage objects

SYNOPSYS
vw = storage.getas(description)
Locate, define, or re-define a view stored in a storage object
vw = storage.view(viewname)
The normal way to retrieve an existing view
storage.rollback(full=0)
Revert data and structure as was last committed to disk. In commit-aside mode, a "full" rollback reverts to the state of the original file and forgets about the aside file.
After a rollback, your view objects are invalid (use the view or getas methods on your storage object to get them back). Furthermore, after a full rollback, the aside storage is detached from the main storage. Use the aside method on your main storage object to reattach it. If you do not reattach it, further commits will (try to) write to the main storage.
storage.commit(full=0)
Permanently commit data and structure changes to disk In commit-aside mode, a "full" commit save the latest state in the original file and clears the aside datafile.
ds = storage.description(viewname='')
The description string is described under getas
vw = storage.contents()
Returns the View which holds the meta data for the Storage.
storage.autocommit()
Commit changes automatically when the storage object goes away
storage.load(fileobj)
Replace storage contents with data from file (or any other object supporting read)
storage.save(fileobj)
Serialize storage contents to file (or any other object supporting write)
ADDITIONAL DETAILS
description - A description of the entire storage is retured if no viewname is specified, otherwise just the specified top-level view.

getas - Side-effects: the structure of the view is changed.
Notes: Normally used to create a new View, or alter the structure of an existing one.
A description string looks like:
     "people[name:S,addr:S,city:S,state:S,zip:S]"
That is "<viewname>[<propertyname>:<propertytype>...]"
Where the property type is one of:
Iadaptive integer (becomes Python int)
L64-bit integer (becomes Python long)
FC float (becomes Python float)
DC double (is a Python float)
SC null terminated string (becomes Python string)
BC array of bytes (becomes Python string)
Careful: do not include white space in the decription string.

In the Python binding, the difference between S and B types is not as important as in C/C++, where S is used for zero-terminated text strings. In Python, the main distinctions are that B properties must be used if the data can contain zero bytes, and that sort order of S (stricmp) and B (memcmp) differ. At some point, Unicode/UTF-8 will also play a role for S properties, so it's best to use S for text.

3. View objects

View implements sequence (list) methods, including slicing, concatentation etc. They behave as a sequence of "rows", which in turn have "properties". Indexing (getitem) returns a reference to a row, not a copy.
     r = view[0]
     r.name = 'Julius Caesar'
     view[0].name # will yield 'Julius Caesar'
A slice returns a modifiable view which is tied to the underlying view. As special case, however, you can create a fresh empty view with the same structure as another view with:
     v2 = v[0:0]
Setting a slice changes the view:
     v[:] = [] # empties the view
View supports getattr, which returns a Property (eg view.shoesize can be used to refer to the shoesize column). Views can be obtained from Storage objects: view = db.view('inventory') or from other views (see select, sort, flatten, join, project...) or empty, columnless views can be created: vw = metakit.view()

SYNOPSYS

view.insert(index, obj)
Coerce object to a Row and insert at index in View
ix = view.append(obj)
Object is coerced to Row and added to end of View
view.delete(index)
Row at index removed from View
lp = view.structure()
Return a list of property objects
cn = view.addproperty(fileobj)
Define a new property, return its column position
str = view.access(byteprop, rownum, offset, length=0)
Get (partial) byte property contents
view.modify(byteprop, rownum, string, offset, diff=0)
Store (partial) byte property contents. A non-zero value of diff removes (<0) or inserts (>0) bytes.
n = view.itemsize(prop, rownum=0)
Return size of item (rownum only needed for S/B types). With integer fields, a result of -1/-2/-4 means 1/2/4 bits per value, respectively.
view.map(func, subset=None)
Apply func to each row of view, or (if subset specified) to each row in view that is lso in subset. Func must have the signature "func(row)", and may mutate row. Subset must be a subset of view: e.g. "customers.map(func, customers.select(...))".
rview = view.filter(func)
Return a view containing the indices of those rows satisfying func. Func must have signature "func(row)" and must return a false value to omit the row.
obj = view.reduce(func, start=0)
Return the result of applying func(row, lastresult) to each row in view.
view.remove(indices)
Remove all rows whose indices are in subset from view. Not the same as minus, because unique is not required, and view is not reordered.
rview = view.indices(subset)
Returns a view containing the indices in view of the rows in subset.
rview = view.copy()
Returns a copy of the view.
ADDITIONAL DETAILS
addproperty - This adds properties which do not persist when committed. To make them persist, you should use storage.getas(...) when defining (or restructuring) the view.

append - Also support keyword args (colname=value...).

insert - coercion to a Row is driven by the View's columns, and works for:
dictionaries(column name -> key)
instances(column name -> attribute name)
lists(column number -> list index) - watch out!

4. Derived views

SYNOPSYS
vw = view.select(criteria...)
Return a view which has fields matching the given criteria
vw = view.select(low, high)
Return a view with rows in the specified range (inclusive)
vw = view.sort()
Sort view in "native" order, i.e. the definition order of its keys
vw = view.sort(property...)
Sort view in the specified order
vw = view.sortrev((propall...), (proprev...))
Sort view in specified order, with optionally some properties in reverse
vw = view.project(property...)
Returns a derived view with only the named columns
ADDITIONAL DETAILS
select - Example selections, returning the corresponding subsets:
     result = inventory.select(shoesize=44)
     result = inventory.select({'shoesize':40},{'shoesize':43})
     result = inventory.select({},{'shoesize':43})
The derived view is "connected" to the base view. Modifications of rows in the derived view are reflected in the base view

sort - Example, returning the sorted permutation:
     result = inventory.sort(inventory.shoesize)
See notes for select concerning changes to the sorted view

5. View operations

SYNOPSYS
vw = view.flatten(subprop, outer=0)
Produces one 'flat' view from a nested view
vw = view.join(view, property...,outer=0)
Both views must have a property (column) of that name and type
ix = view.find(criteria..., start=0)
Returns the index of the found row, or -1
ix = view.search(criteria...)
Binary search (native view order), returns match or insertion point
ix, cnt = view.locate(criteria...)
Binary search, returns position and count as tuple (count can be zero)
vw = view.unique()
Returns a new view without duplicate rows (a set)
vw = view.union(view2)
Returns a new view which is the set union of view and view2
vw = view.intersect(view2)
Returns a new view which is the set intersection of view and view2
vw = view.different(view2)
Returns a new view which is the set XOR of view and view2
vw = view.minus(view2)
Returns a new view which is (in set terms) view - view.intersect(view2)
vw = view.remapwith(view2)
Remap rows according to the first (int) property in view2
vw = view.pair(view2)
Concatenate rows pairwise, side by side
vw = view.rename('oldname', 'newname')
Returns a derived view with one property renamed
vw = view.product(view)
Returns the cartesian product of both views
vw = view.groupby(property..., 'subname')
Groups on specified properties, with subviews to hold groups
vw = view.counts(property..., 'name')
Groups on specified properties, replacing rest with a count field
ADDITIONAL DETAILS
find - view[view.find(firstname='Joe')] is the same as view.select(firstname='Joe')[0] but much faster Subsequent finds use the "start" keyword: view.find(firstname='Joe', start=3)

6. Mapping views

SYNOPSYS
vw = view.hash(mapview, numkeys=1)
Construct a hash mapping based on the first N fields.
vw = view.blocked(blockview)
Construct a "blocked" view, which acts as if all segments together form a single large view.
vw = view.ordered(numkeys=1)
Define a view which assumes and maintains sort order, based on the first N fields. When layered on top of a blocked view, this implements a 2-level btree.
ADDITIONAL DETAILS
blocked - This view acts like a large flat view, even though the actual rows are stored in blocks, which are rebalanced automatically to maintain a good trade-off between block size and number of blocks.
The underlying view must be defined with a single view property, with the structure of the subview being as needed.

hash - This view creates and manages a special hash map view, to implement a fast find on the key. The key is defined to consist of the first numKeys_ properties of the underlying view.
The mapview must be empty the first time this hash view is used, so that Metakit can fill it based on whatever rows are already present in the underlying view. After that, neither the underlying view nor the map view may be modified other than through this hash mapping layer. The defined structure of the map view must be "_H:I,_R:I".
This view is modifiable. Insertions and changes to key field properties can cause rows to be repositioned to maintain hash uniqueness. Careful: when a row is changed in such a way that its key is the same as in another row, that other row will be deleted from the view.

ordered - This is an identity view, which has as only use to inform Metakit that the underlying view can be considered to be sorted on its first numKeys properties. The effect is that view.find() will try to use binary search when the search includes key properties (results will be identical to unordered views, the find will just be more efficient).
This view is modifiable. Insertions and changes to key field properties can cause rows to be repositioned to maintain the sort order. Careful: when a row is changed in such a way that its key is the same as in another row, that other row will be deleted from the view.
This view can be combined with view.blocked(), to create a 2-level btree structure.

7. Rowref objects

RowRef allows setting and getting of attributes (columns)
RowRef encapsulates a (view, ndx) tuple.
Normally obtained from a view: rowref = view[33]

8. Property objects

Property has attributes name, id and type. Example: p = metakit.property('I', 'shoesize')
Note that a property is used to describe a column, but it is NOT the same as a column. That is, in a given storage, the property Property('I', 'shoesize') will be unique, (that is, no matter how many instances you create, they will all have the same property.id). But that one property can describe any number of columns, each one in a different view. This is how joins are done, and why "view.sort(view.firstname)" is the same as "view.sort(metakit.property('S','firstname'))".


© 2005 Jean-Claude Wippler <jcw@equi4.com>