Should I use multiple files, what are good limits

Frank Sergeant asks, on 21-12-1999:

 > I gather that I can keep several separate views in a single file or
 > each in its own file.  What are the tradeoffs?  I'll have just a
 > few views with large numbers of rows.  [... 1,000 to 15,000 ...]

I would not worry about those amounts. The advantage of keeping it all in a single file is self-consistency - with a single transaction.

My rule of thumb is: stay under 100,000 rows and under 10,000 subviews (do count all subviews, also empty ones - i.e. a nested view in there should not have more than 10,000 rows). With many strings > 1 Kb, use memo fields, and count each memo field as a subview in the above rule).

As for combining them all into a single file, yes as much as possible, except when these limits risk being exceeded. So a datafile with 100 unnested views of 100,000 rows plus 1 nested view of 10,000 rows is ok).

And for actively changing databases, stay a little lower and definitely do not commit on every row change.

 > I must be prepared to have, eventually, well over 100,000 rows.  I've
 > been thinking that I'd make the ledger items view a subview of the
 > customer view, as each customer's subview would then contain a
 > reasonably small number of rows.

If by "eventually" you mean six months from now, say, then don't worry. Try to stay away from 100,000 subviews - that is a bit slow to open at the moment (seconds).

Time will work in your favor - I intend to take MK's scalability way up.

 > I was delighted to read the description of the internals of Metakit,
 > particularly about storing by columns instead of by rows and how that
 > can (perhaps depending on the application) eliminate the need for
 > indexes.  I love it.  I do have a question, though.  How do you find
 > the item in a column given the row number?  [... variable width ...]

Ah :) - that's the fun part.

With string fields, I do scan. Lazily, and once. Opening a file, changing the last entry of a huge string view, and comitting it, is slow. A more segmented design will completely solve that, I expect.

So if you expect large columns of strings (either very large strings, or lots of them - say > 10 Mb columns), then consider using some subviews to break things up. For smaller cases, it'll often be just fine.

Or you could use fields of type "B" (Bytes), which maintain a separate integer column of sizes, so there is no scanning at all. In Python and Tcl "B" behaves as a string (or byte arrays in the case of Tcl >= 8.2).

-jcw

 ------------------------------

21 Dec 1999) Thanks. It looks like I don't have any worries at the moment about the number of records I'll need to handle. Just to be sure I understand, suppose I have a customers view that has 10,000 customers (rows) and that there is a subview, say for phone numbers. Suppose each customer has 3 phone numbers. Thus, there are 30,000 rows of phone numbers all together, but each customer has just 3 phone number rows. Do you count the above database as having one view with one subview or do you count it as one view with 10,000 subviews (one subview for each customer)? For row counts, do you count 10,000 for the customers view and an average of 3 for the phone subview, or do you count the phone subview as having 30,000 rows?

 -- Frank Sergeant ([email protected])

JC: Each of the phone view has 3 rows, the customer view has 10,000 rows, and there are 10,001 (sub)views in all. Open/commit performance is currently proportional to the total number of columns (i.e. #views x #cols/view) - hence the suggestion to stay under 10,000 total subviews in the entire file. Access/modify performance is proportional to column size, hence the suggestion to stay under 100,000 rows in each view. But don't let this scare you, it's a gradual trade-off - I hope to raise both limits by (several?) orders of magnitude later. yeah boy!!!!!!!!!!!!!!!!!!!!!!!!!!!