Selection performance in Mk4tcl 1.1

The Tcl interface to Metakit only provides a linear-search based selection command (mk::select). There are plans to expand this to offer an interface to MK's sorted views, which are much more efficient and support binary search access to the data. A more elaborate API has been implemented in the MkWrap extension for Python, and will probably serve as model for the next Tcl interface.

For now, though, linear search is all there is. Here is a script which illustrates the performance implications:

    package require Mk4tcl
    set FILE ampleev.dat
    file delete $FILE
    mk::file open db $FILE
    mk::view layout db.list {i:I s:S}
    proc fill {{limit 10000}} {
        mk::view size db.list $limit
        for {set i 0} {$i < $limit} {incr i} {
             mk::set db.list!$i i $i s [format %06d $i]
        }
    }
    proc fetchall {} {
        set limit [mk::view size db.list]
        for {set i 0} {$i < $limit} {incr i} {
             mk::get db.list!$i
        }
    }
    proc select_I {value} {
        mk::select db.list -count 1 i $value
    }
    proc select_S {value} {
        mk::select db.list -count 1 s $value
    }
    foreach x {10 100 1000 10000 100000} {
        set count [expr {100000/$x}]
        if {$count > 1000} {set count 1000}
        set half [expr {$x/2}]
        set hstr [format %06d $half]
        puts "\n SIZE $x - $count iterations:"
        puts "  fill   = [time {fill $x} $count]"
        puts "  all    = [time {fetchall} $count]"
        puts "  I miss = [time {select_I -1} $count]"
        puts "  S miss = [time {select_S x} $count]"
        puts "  I half = [time {select_I $half} $count]"
        puts "  S half = [time {select_S $hstr} $count]"
    }

The ouput of this script, using Tclkit on Windows 98 (PII/400), is:

 SIZE 10 - 1000 iterations:
  fill   = 720 microseconds per iteration
  all    = 600 microseconds per iteration
  I miss = 220 microseconds per iteration
  S miss = 270 microseconds per iteration
  I half = 220 microseconds per iteration
  S half = 220 microseconds per iteration
 SIZE 100 - 1000 iterations:
  fill   = 7250 microseconds per iteration
  all    = 6320 microseconds per iteration
  I miss = 1710 microseconds per iteration
  S miss = 2300 microseconds per iteration
  I half = 940 microseconds per iteration
  S half = 1260 microseconds per iteration
 SIZE 1000 - 100 iterations:
  fill   = 71900 microseconds per iteration
  all    = 63200 microseconds per iteration
  I miss = 17000 microseconds per iteration
  S miss = 23600 microseconds per iteration
  I half = 8800 microseconds per iteration
  S half = 11500 microseconds per iteration
 SIZE 10000 - 10 iterations:
  fill   = 747000 microseconds per iteration
  all    = 632000 microseconds per iteration
  I miss = 170000 microseconds per iteration
  S miss = 236000 microseconds per iteration
  I half = 82000 microseconds per iteration
  S half = 115000 microseconds per iteration
 SIZE 100000 - 1 iterations:
  fill   = 8950000 microseconds per iteration
  all    = 6260000 microseconds per iteration
  I miss = 1650000 microseconds per iteration
  S miss = 2250000 microseconds per iteration
  I half = 880000 microseconds per iteration
  S half = 1160000 microseconds per iteration

The resulting datafile contains 1,100,035 bytes.

If you need to implement faster selections, one option is to store all key values in a Tcl array, with the row number as key.

-- JC

P.S. Thanks to Andrew Ampleev for bringing up this issue.