This describes a first implementation of Starchive - an associative and versioned Starkit repository.
The entire repository core consist of two sets of files:
- A huge collection of files, stored in the data/ directory, with meaningless names ("signatures") derived from the file contents and size. Each file is stored exactly once (in compressed form).
- A collection of starkit "maps", one per starkit version, in the kits/ directory. Each contains a listing of the files and directories in that starkit, as well as the corresponding names in the data/ area. These maps are relatively small, about 1% of the original starkits on average.
Access to the repository is by starkit name and version ID. If you know both, the Starchive has the information needed to reconstitute the original starkit. If you know only a starkit name, you can enumerate the available versions and their latest modification dates.
Internally, starchive keeps all pieces separately, but has all the information needed to reconstruct each version. The advantage of this approach is that multiple version of a starkit which have a lot of the same file versions, are stored very efficiently. Changing a single file and resubmitting a starkit to the archive will add one new file and one new starkit map to the starchive repository, regardless of how many files the starkit contains.
Some more comments about this approach:
- The core data is stored in individual files, not a database - this was done to allow maximum freedom in choosing database designs on top of this for searching, reporting, and repository maintenance.
- File names are based on MD5 (128 bits) and file size (27 bits), leading to a 31-character base-32 name (using chars 0..9 and A..V).
- Files are stored in compressed form, i.e. copied verbatim in and out of Starkits using the raw Metakit interface.
- The data/ directory is split in 1024 subdirs so millions of unique files and versions can be handled.
- The MK layout for starkit maps is: {name parent:I {files {name size:I date:I md5:B}}}
- Starkit maps names are made from the starkit root name and its version ID, e.g. kitten-67149-18796
- Starkits can be reconstructed on the fly - even doing so during download is feasible.
- Starkit headers and starpacks are not yet tracked, this will be addressed later.
- The starchive repository design can be used passively, i.e. as a static collection of files served by an http or ftp server.
- Because of the passive option, starchives can be mirrored and use tools such as rsync.
- There is no state involved between clients and the server, neither for submitting nor for fetching/updating starkits.
This work-in-progress Starchive is at https://www.equi4.com/starch/.