metakit-fileformat - Metakit File Format
How the metakit library uses files in the specified format is outside
of the scope of this document, although in some sections, hints for
specific uses might be given.
To ensure an unambiguous use of all terms inside of this document and
when discussing its contents a glossary was created. See section
GLOSSARY at the end.
It was decided to specify the format in the form of a grammar in
Extended Backus-Naur Form (EBNF) augmented by free-form text to
capture the context-sensitive parts of the language. It is assumed
that the reader of this document is familiar with EBNF.
In contrast to most other systems which handle their data row-wise it
manages the data in a column-oriented way, i.e. all data for a single
column is handled together. This characteristic is reflected in the
file format too.
What would be called tables in other relational database systems are
known as views in Metakit. Views consist of columns,
which store specific pieces of data in cells. The cells at the
same row-index in all columns of a view are called a row.
Another difference is metakit's ability to define subview
columns, which are columns where the data in each cell is a complete
view in its own right, although they are sharing the same structural
definition.
This section describes only the basic mapping required to create the
table of contents (See TableOfContents), and none of the
secondary itemvectors indirectly reachable through the
primary itemvectors of a column listed in the table of
contents.
This situation changes when data of varying and arbitrary length is
involved, be it strings or just binary data (types S and
B). For them the simple method of storing all the data in one
itemvector and the sizes of the items in a second scales badly as even
minuscule changes cause the copying of large amount of data.
To evade this trap the file format uses a slightly more complex
method. Instead of only two itemvectors it employs three. The first
two are the same ones as for the simple method, with a small
change. While the first itemvector contains the sizes for all items,
the second itemvector contains only the data for the items with a size
> 0. Items whose size is recored as zero are not stored in the second
itemvector, but are indirectly reachable through the third
itemvector, a catalog. Each of the third itemvector's items records
the location of another itemvector in the file on the one hand, and
information determining to which row in the column the item belongs
to. In other words, how to interleave the items reachable through the
catalog with the items in the first two itemvectors to reconstruct
their proper order at the logical level of the column.
With the above structure in place any writer of a database is now free
in his decision where to actually place the variable sized data of a
cell when writing to the file. Namely either directly into the second
itemvector, or into a block of its own with the location of that block
recorded in the catalog vector. By storing smaller data directly and
larger data indirectly the performance impact of the large data is
reduced considerably, because now only the itemvector containing the
catalog has to be copied for changes, whereas the large data blocks
often can be left in the location initially given to them.
The relevant symbols of the grammar are IVecCatalogData and
VariableMapping. See section FORMAT GRAMMAR for
their definition.
DESCRIPTION
This document specifies the file format used by the metakit database
library for persistent storage of its databases. The same format is
also used for serialization and subsequent transfer of a database over
some communication system, like pipes and sockets.
BACKGROUND
The background for this specification is metakit
(https://www.equi4.com/metakit), a flexible database system
developed by Coen Siegerink (mailto:[email protected]).
TYPE DEFINITIONS
Metakit supports the following six types for its columns. The key in
the list below is the character used by metakit as indicator for that
type. See section STRUCTURE DEFINITION for the place
in which these characters are used.
COLUMN MAPPING
When metakit stores column data into a file or serialization it places
them into one or more itemvectors, the physical containers for
the data. How many itemvectors are required is dependent on the type
of the column, and on the data contained in it.
The exact contents of each itemvector are described in the upcoming
grammar. See IVecData and its variants.
VARIABLE SIZED DATA
One of the consequences of using a column-wise representation for views
is that for any insertion, deletion, or change of a row the system has
to relocate and copy all itemvectors for all the columns in the
view. This is not so big a problem for data of a fixed size, like for
the types I, F, D, and L. For them
this operation is only invoked when inserting or deleting
row. Changing the value of a cell invokes only the relocation and
copying of the itemvectors for one column, and they tend to be
relatively small.
For the example let us assume that we have items 0 and 3 and 6 all having small amounts of data, items 1 and 4 are empty, and 2 is a larger memo item. Then the situation would be: Column 0, the data: concatenated contents of items 0, 3, and 6. Column 1, the sizes: sizes for entries 0, 3, and 6, the rest zeroes. Column 2, the memos: 2 as byte-packed integer, meaning skip 2 rows the size of the data in row 2, as byte-packed integer the pointer to the data in row 2, as byte-packed integer |
0x80 : First is last byte, payload = 00000000/2 = 0/10 0x00 0x80 : Negative number payload = 0000000/2 1-complement = 1111111/0 = -1/10 0x94 : First is last byte. payload 0011000/2 = 20/10 0x00 0x94 : Negative number payload = 0011000/2 1-complement = 1100111/2 = -21/10 |
In contrast to the overall file format the lexical unit here is a single UTF-8 character (Char).