An overview of the MetaKit data format

Coen Siegerink
Equi4 Software
December 1999

INTRODUCTION

MetaKit is an efficient, small-footprint, robust, and platform-independent database library. The name "MetaKit" is used as name for the file format, as well as for the implementation (coded in C++, with interfaces to Python and Tcl). MetaKit 2.0 is freely available as open source, with optional commercial support through my company, Equi4 Software.

This document describes some of the main elements of the design of the file format used in release 2.0 of MetaKit - which reads all prior formats back to the 1.0 release of March 1996.

For general information about MetaKit, you are kindly referred to information on the world wide web - the MetaKit home page is at https://www.equi4.com/metakit/. For the purpose of this document, here is a summary with the capabilities of MetaKit which most affect the design of the file format.

Data is stored in native format, reversed endian-ness is detected and dealt with on other platforms
Transaction/commit uses "Stable Storage" techniques, with automatic rollback after catastrophic failure
Data can be loaded on demand, or serialized for transport over any communication channel
Integers use "adaptive sizing" to store each value in 1..32 bits, automatically adjusted to match the range
Other formats include: 0-byte terminated strings, binary data (2 formats), and 4/8-byte IEEE floating point
Multiple relational ("flat") and hierarchically structured datasets can be stored in a single file
Datafiles are self-descriptive, generic utilities can be used to inspect and manipulate any MetaKit datafile
MetaKit supports instant on-the-fly datafile restructuring, with full commit/rollback capability

MetaKit is being used by hundreds of developers, in products ranging from custom-made client-server applications, software utilities, website databases, numerical/statistical data-storage, groupware, development tools, to financial applications such as POS and home banking. It can be embedded into the application, requiring no installation - or used as a shared library (DLL).

BASIC DATA STRUCTURE

To describe the format used for the datafiles, it is necessary to introduce some terminology and to describe how data is organized. What MetaKit does, is present a more or less relational perspective on data storage.

Data is stored in tabular format, called a view.
Views consist of zero or more rows of information.
Each row consists of one or more properties containing one data item.
Properties can be strings, integers, floating point, binary data, but also subviews.

The first three of these are equivalent to the relational table, record, and attribute. The fourth makes it possible to store hierarchically structured data (subviews are an advanced feature and will not be further described here).

But the fundamental concept of MetaKit, is that all data is stored column-wise. A view (table) with 10,000 rows (records), consisting of 25 properties (attributes), is stored on file as a data structure pointing to 25 columns of bytes. Each of these columns may well have a different format - depending on their datatype, and each will contain 10,000 entries. Each column is stored as a single contiguous area on file, which can be reallocated (i.e. moved) to accommodate size changes later on.

This column-based data organization has far-reaching consequences for the implementation. For one, the implementation fully hides the column-wise structure and present a normal container-like row-wise API, which matches the conceptual / relational model. Furthermore, a range of techniques has been implemented to offer high performance.

FILE FORMAT OVERVIEW

Once you switch your mind to this column-wise approach, the format of MetaKit datafiles becomes quite simple to describe:

An 8-byte header
A structural definition of all views stored in this file (this is not necessarily right after the header)
For each view: a row count and a number of <offset,size> pairs, each defining a column in the file
And - scattered all across the file - the actual columns of data

For a view of N rows by M columns, the row count of N plus the M column definitions fully describe its contents - just like a row count of N with N row definitions would in a traditional row-by-row database.

Note that the amount of information needed to define all views and columns can remain very small. The number of rows in each view affects only the size of each column, not the number of them. This effect is used while opening a datafile: MetaKit quickly reads the header, description, and all column position/size information. From then on, access to data is basically instant - with seeks doing the rest (actually, most of it is memory-mapped file access). To use the previous example: a view with 10,000 rows of 25 properties ends up being managed as 25 columns on file - which is a very simple task.

The row counts and the <offset,size> tuples defining all columns are stored in an tagged-byte format, with as many bytes as need to represent each integer value. This format is space efficient for small files, and is able to adapt to file sizes (and offsets) which are more than 32 bits. All values are stored consecutively and are loaded very quickly (this includes the more complex case when subviews are part of the data structure).

STRUCTURE DEFINITION

The description stored in each datafile is a string which precisely describes the name of all views and subviews, and the name and type of all properties in these subviews. An example of such a definition string (as stored on file) will make it clear how the information is represented:

companies[name:S,offices[department:S,address:S,employees[name:S,salary:I,photo:M],phone:S],revenue:D]

Datatypes are currently defined as one of the following:

S - string (null-byte terminated)
I - integer (1..32 bits, adapting at run time)
F - 32-bit floating point
D - 64-bit floating point
B - untyped binary data
M - "memo" binary data (see below)

The above example describes a datafiles with information about companies, each having a variable number of offices, which in turn have a varying number of employees. Other information can be added in the same file, it is not limited to a single view (companies in this case).

This string, as well as all strings stored as properties of rows, is stored in UTF-8 format, which allow storing Unicode strings while maintaining compatibility of datafiles across all platforms.

STORING BINARY DATA

The file format described so far is useful for small to medium-sized datasets (see the section on "performance" below), but as column sizes grow, performance tends to suffer in a number of ways. To meet requests to store larger amounts of data efficiently, "memo" properties were added in March 1998. These add a single level of indirection, and allow storing binary data as individually allocated objects in the datafile. This mechanism works well when rows hold data items which are each of up to 1 Mb or so. The current limitation is that each such memo-item is loaded and saved in one step. A future release will allow partial reading and writing of such items, and will in fact allow very efficient insertion and deletion at any position within such items. This will allow access, and storage of concurrent independently emitted data streams (it's mostly a matter of designing a proper API to get at the functionality - which is already present).

FILE FORMAT DETAILS

With all this conceptual information out of the way, it is now possible to describe the file format is somewhat more detail.

Header: The 8-byte header contains: a 2-byte magic number, which is also used to detect endian-ness, 2 bytes reserved for future use, and a 4-byte file offset pointing to the structural information.
Structural information: The offset in the header points to a block of data consisting of:; The counts, offsets, and sizes are all stored using a varying number of bytes, most significant bits first: each byte contains a "final" bit (bit 7), which is set only on the last byte, and 7 "payload" bits from the value itself (bits 0..6). This format is compact and platform-neutral.
Columns: Data columns contain all items of the corresponding property, concatenated into a single byte vector. The format and size of these columns is determined by their datatype:; Each column contains items of a single type, and can be considered as highly optimized linear arrays of data.

Note: the 4-byte file offset in the header will be turned into a more general "extended header" in the next file format version to drop the current 32-bit file size restriction. A number of changes are planned for the way structural information is stored, to make it possible to open datafiles more lazily and perform efficient partial commits.

FREE SPACE MANAGEMENT AND RECOVERY

The datafile format of MetaKit contains no other information than what has been described so far. Specifically, there is no "free space" pool. This is possible, because free space can be deduced from what is present in the file - it consists of those areas of the file which are not occupied by the information described so far (header, structure def, column defs, columns).

This lack of redundancy improves robustness (the potential for inconsistencies is reduced), is more efficient (less information to carry around), and reduces file size.

Recovery is based on "stable storage" - a very effective technique which makes the datafile resilient against catastrophic failure. It all works by never writing changes to a datafile over current data. All changes are allocated and written to currently unused areas of the file (extending it if necessary). Only changed data is rewritten. On commit, all changes are flushed, and when this has been completed, a single offset in the file header is overwritten (this is atomic) - pointing to the new state. Once the commit is complete, free space in the file is redefined to include old and now obsolete areas of the file, and to exclude the newly allocated data.

To date, with thousands of users known to be using software based on MetaKit in production-grade commercial products, no case has ever been reported of a corrupted data file. MetaKit does not contain any recovery/repair code - there is no need.

The trade-off is that datafiles which get modified can grow to roughly twice their optimum size. Despite this, MetaKit datafile sizes are often smaller than with other approaches. The reasons for this are that MetaKit supports variable-sized data with no overhead, that integers tend to be packed very densely, that there is no overhead per row, and that techniques such as B-trees with partially-filled nodes are not used.

It goes without saying that all unused data areas get re-used when possible - file size usually increases during the first few commits, and remains more or less stable after that. For those cases where slack is unacceptable, a simple full re-save can be used to produce a fresh, optimally-packed datafile.

ON-THE-FLY RESTRUCTURING

The column-wise datafile format makes it possible to restructure datafiles - i.e. add or delete properties - very efficiently. All it takes, is reading the structural data (which happens during open anyway), creating or deleting one or more columns, and storing the adjusted structural data. Since no rows are accessed at all, this is virtually instant.

This capability has recently been used to implement a generic database server - it starts out empty, with no idea what data it will store, and it adapts whenever clients connect. Although this is in itself an interesting technical feature, the value lies in long-lived data, since it is now possible to store information in whatever format seems appropriate right now, and to write software which evolves to work with more complex structures overtime. Whenever a datafile is opened, its format can be inspected, applying the transformations which are required to bring it up to date - instantly, during open. Even if the data is read-only, such on-the-fly adaptations can be used for the duration of the run (simply by never committing).

The restructuring capabiltity, and the fact that it can be done almost instantly, is very useful for long-lived data. It simplifies the task of writing software which can handle all file formats used in previous releases (backward compatibility). Adding or dropping information as new releases of a product are issued no longer leads to compatibility problems or lengthy conversions. Furthermore, techniques are being developed to let older software deal with files created by newer releases (forward compatibility). This solves a common problem in today's software, which forces users to continuously upgrade their software to be able to easily exchange files with others.

PERFORMANCE

The performance of this data format is extremely difficult to assess. During the evolution from version 1.0 to the current 2.0 release, improvements have been made of some two orders of magnitude. This is probably not very informative, since it might simply reflect how bad the original design was.

But there are several factors which can help illustrate why this is probably not the case:

Performance matters only in those parts of the code which are frequently executed: i.e. inside loops.
Locality of reference can have a dramatic impact on performance with today's CPU caches.
Minimizing I/O requests has become meaningless, when memory-mapped files and paging is used.
Many repetitive loops over data tend to process a large number or rows, but to access only a few columns.

It turns out that MetaKit performance varies widely, depending on the design and use of the data structures. Note that there are virtually no database indexes. This is not a lack of functionality, but a deliberate design choice. Experience shows that MetaKit frequently outperforms other approaches by mere brute-force. A recent test on a modern Pentium II / 400 MHz CPU, shows that search speeds exceed 500,000 string scans per second and 1,000,000 integer scans per second. There is a definite trade-off between indexing (with techniques such as B-trees and extensible hashing) and the overhead it imposes for updates - and mere scanning.

Evidently, at some point, it will be better to index by key values than to scan millions of strings for a match. This is where subviews in MetaKit come in handy. By structuring data in a "slightly hierarchical" manner, searches again often end up being over just thousands or tens of thousands of entries. Also, keeping data sorted (in some sense) and then doing binary search can often be an effective option (this is part of the Metakit API).

Sequential searching is well suited for keyword searches and for regex pattern-matching, and by its very nature well suited for ad-hoc queries: all properties are searchable without the need to decide what key indexes to define up front.

Some further explanations why performance is high in MetaKit:

No indexes means smaller datafiles, this increases memory residence and reduces random seeks
Inner loops over a single column are very effective: data is contiguous on file and caching works optimally
The code is simple, and therefore small - it is more likely to fit in the CPU cache in its entirety
Row width does not affect search/access performance, nor does the number or size of extra properties.

In conclusion: compact data representations work increasingly well with today's fast CPU's and large RAM sizes. By turning a dataset which is 100 Mb with traditional means, into a few columns requiring 20 Mb, databases not only end up being completely memory-resident, the column-wise structure for each property also makes CPU caches perform at their very best.

LIMITATIONS

There are some limitations in the current MetaKit implementation, these are currently being addressed:

When column sizes grow, update performance suffers (the simplest solution is to introduce some subviews)
Concurrent file updates are currently impossible due to cache coherence issues (solution: client/server or diff's)
Data items straddling a 4 Kb page boundary get read in, instead of taking advantage of memory-mapped files

There are also some issues which affect the file format design (to be addressed in the next file format revision):

Open/commit performance is proportional to structural complexity (need a way to do partial opens/commits)
There should be a way to adjust alignment of data stored on file, so memory-mapped files work better
Additional safeguards are sometimes needed to detect currupted datafiles (such as file-transfer errors).

Work is currently under way to address each of these issues. This will allow the efficient use of MetaKit datafiles when they are in the Gb range. The changes will be made in such a way that backward compatibility can be maintained (i.e. new releases will retain the ability to read 2.0 format datafiles, and will transparently perform adjustments when writing to them).

CONCLUSION

In the end, the proof is in the pudding. As it turns out, three years of MetaKit use and evolution have shown that a lot of performance can be achieved as a result of reducing file format and implementation complexity. Robustness is achieved through minimizing redundancy, while fail-safety is achieved through a form of controlled duplication.

Both of these have led to a format which is very simple, flexible enough to evolve (half of the datatypes were added after the fact), and which has the capability of being instantly restructured to adapt to new data structures. This latter property of the MetaKit datafile format is what makes it future-proof for application developers - in a very practical sense.

There are some drawbacks to using column-wise storage, but the current implementation shows that despite this, a number of technical "tricks" can lead to an implementation (consisting of 16,000 lines of C++) which shows excellent performance in databases scaling to 10,000 views/subviews or 100,000 rows. Future versions will aim to make 10,000 views/subviews times 100,000 rows efficient.