The nuts and bolts of scripting

Scripting languages are not about glue - but about nuts and bolts...

A common description of scripting languages such as Perl, Python, and Tcl, is that they are good at glueing existing software components together. This paper will describe why I think that is not good enough, and how a new scripting kernel design can help. Also, I propose to refine the concept of glueing into two distinct mechanisms: high-level nuts & bolts and low-level glue . And last but not least, this is also somewhat a runaway paper without a clear plot. You have been warned...

Programmer productivity

As computer hardware gets faster, and software gets more complex, we care increasingly about how effective programmers are in designing, implementing, testing, and adapting new software. Scripting turns out to be extremely effective in each of these areas, replacing formal design by exploration of the problem domain and evolution of software as the requirements and technical feasibility gradually become clearer. The key to this, is the dynamic nature of scripting, and the immediate way of implementing software. With scripting, experimentation is the way to move forward, to help find out what works and what doesn't, and to iterate between adding new functionality and refinements. With scripting, you don't need to make as many critical choices up front as with more static systems programming languages. Given that design is also a learning process , this means you'll make better decisions later on. Scripting is about late binding of a solution to a problem.

It's all about impedance, really

The real productivity gains of scripting come from being able to re-use and integrate more existing software components. No application developer working with Perl, Python, or Tcl, would consider writing software to implement associative containers (such as hash tables), or a standard networking protocol implementations, or a new pattern-matching engine. Or even a GUI. It's all there. It works, it's documented, the bugs have been beaten out of this code long ago, and it has often been optimized by experts, with little chance of being improved through some basic tweaking. Scripting languages thrive on being able to combine commonly used techniques and ready-made components. By offering very flexible data models, fetching text from a file, in whatever format it happens to be, manipulating the information, and visualizing it all happens without ever having to think about whether integers are stored as signed or not, how strings a represented internally, and even whether the data needs to be turned into a string form for presentation. Data just flows between all the modules involved in the application - or throws easily wrapped exceptions when limits are reached or unforeseen conditions prohibit normal continuation. Scripting is so easy , because there is hardly ever any impedeance mismatch between differents parts of the code. Unlike C, C++, Java, and Pascal, data types can vary at run-time, so you don't have to decide or specify such details.

But it's also a balancing act

The dynamic nature of scripting languages means that some errors will be caught later than if you had to specific exact dataypes during the usual "compilation phase". For application design, this is fine, since design and trials are taking place as a single activity anyway. But for the lower levels of an application, such as the many library modules one often uses, this freedom is unfortunate. First of all, knowing what datatypes are expected by library modules can be very helpful to understand what the module does and how to use it. Also, many library modules are very simple, and merely exists to shorten a common task. Those have often been written for very specific purposes (read: with fixed datatypes in mind) and/or with high performance in mind. Then there is usually a plethora of system interface code, which is very strictly specified (and again: reliant on fixed datatypes). A lot of this code is either simple or particularly well-optimized - on other words: dynamic types do not offer much here.

Why scripting isn't good enough

The price of highly dynamic (run-time) behavior is often a severe loss of performance. Given that scripting is first and foremost a tool to make the programmer more effective - a goal which today's languages achieve admirably well - the cost of losing performance during execution is often swept under the rug. Rightly so, since "fast enough" is all that matters, and scripting languages are very often more than fast enough at run-time despite the dynamism. Then again, few people have probably ever considere writing the scripting language itself in a scripting language. A good example of an application area which is extremely useful in scripting is the GUI layer - Tk is built on top of Tcl, and relies considerable on some of its features, but it is not written in Tcl... It is common to point to such areas as being better dealt with in C or C++. Another field where a scripting language has not been used widely for implementation, is databases. It seems counter-intuitive to consider such very complex problem domains to be outside the reach of scripting. Given the common rule that 80% of performance is determined by 20% of the code, one might expect that scripting could well be used for the remaining 80% of the code (which then shrinks to perhaps just 20%, but that's a different story...). Somehow, scripting languages are not applied to these problem domains - could it be that they are not good enough?

Going back to C and C++?

John's Ousterhout's "Tcl" is one example of how scripting can work on an equal basis with C (and hence C++). Tcl can be embedded in a C program, and C "extensions" can be added to a Tcl script environment. So whenever you yell "I need more speed", the Tcl community will respond with "write it as a C extension". The Tk GUI toolkit is a prime example of this, and there are hundreds of other extensions interfacing and speeding up Tcl - in C. If you want to do I/O or use system-specific facilities, the response is always: "write a C wrapper". With tools like SWIG, writing wrappers can even be automated. C also is the lingua franca of system implementation - the Perl, Python, and Tcl languages are all implemented in C. Because of this, and because C is available for just about anything that dares call itself a CPU, scripting has been implemented on an amazing range of platforms. There is nothing which demonstrates total vendor and industry independence better, than being able to run nearly unmodified scripts, using only freely available software. In a way, using C to implement a higher level, in the form of scripting languages, shows perfect synergy. The price to pay, is that C compilers are complex beasts, with complex installations, and complex versioning dependencies - especially when considering the very elaborate "run-time libraries" one has to include for serious use. Compilers are a technology about which everyone but developers couldn't care less about. You want to have well-balanced wheels on your car (so the tires don't wear out too quickly) - but do you want the balancing equipment itself in your garage? Add tools like autoconf and make , and you'll have a mix which is meaningless to 99,9% of the people on this planet. Not to mention the configuration and build conflicts these tools often create. Part of the benefit of scripting, is that it lets more people do more things with a computer - and the fraction of those who want to understand what compiling means, is decreasing all the time. To them, C and C++ are funny names - and show stoppers...

There has to be a better way

There is: get rid of the compiler . There are two ways to do this: one is to move to binary distribution, letting experienced C/C++ developers do the grunt work by providing binaries (yet another funny name), which get incorporated into scripting languages as "dynamic extensions". That moves the burden of solving all compilation issues, and of dealing with all platform variations, to the most qualified people. Unfortunately, the economics of this are all wrong. Why would a creative programmer, who just unleashed a new neato package for free, spend his time agonizing over all the messy details of John Doe's nearly extinct HyperCPU from Bohemia? The second way is to get rid of C and C++ entirely. Ahem, that will never happen, ther simply is too much C code out there... Okay, here is a slightly less extreme goal: avoid the need for C for the most common cases: increasing the speed of specific tasks needed in a scripting language, and interfacing to existing C libraries. Then, one could focus on getting some core functionality right, but avoid the need for perhaps 90% of cases where C is being used now. Now, of course, you ask: "how can one avoid C?". The answer is provocatively simple: with machine language. We don't need C for telling things to a computer, we merely chose to create C, because it was so much simpler and more productive than machine language. The first implementations of C generated symbolic assembly language, remember? Then, we found out that the intermediate assembly language was irrelevant - once there was a compiler which generated it, why not simply subsume the assembler's job and go straight to machine code. Scripting is now repeating this trend for C - why not skip C (and assembly) and go straight to machine code?

Machine code - are you kidding?

No, this is a serious proposal. Does that mean we get all the machine dependencies back in? Not necessarily - one could use threaded code (i.e. Forth) for a general design, optimizing to more machine specific cases later, and only when it makes sense to spend the extra effort for a particular architecture. Is this the beginning of a huge step back in abstraction, then - fiddling with concepts as primitive as registers, ALU's, and such? Not necessarily (same argument: Forth), but also it must be noted that the machine code is generated by higher-level implementations. Take Python: why not write a Python system in Python, letting it generate runnable low-level code for itself? As with compilers, there is a bootstrap problem, but apart from that it should be feasible. So, are we going to move around only scripts, or a huge set of language-specific bytecode collections? Not necessarily - Oberon uses a highly compressed form of abstract syntax trees (AST) as intermediate language, generates machine code while loading the software, and proves that it can offer higher performance than running pre-compiled machine code off the disk (due to less I/O: disks are slow, CPUs are fast). There is no reason why ASTs could not be used for highly dynamic scripting languages - it might in fact be feasible to come up with a generic AST structure which works for several languages. We may not even need to do such a good job of generating machine code, given that it ends up running so close to the metal. The key, is that intermediate forms are "pure content" (ASTs), instead of some language designed a few decades ago (C), aimed at programmers.

Low-level glue is for tinkerers

Scripting, in all its high-level abstraction desperately needs low-level code and all the tricks one can come up with the extract maximum performance from silicon. That means core data structures need to be extremely carefully designed and optimized, something which no mortal can get right the first time. There are agonizing trade-offs, brilliant tricks, extensive profiling runs, and totally unexpected low-level bugs to deal with. Yet even in all this bewildering low-level code, the 80/20 rule *still* applies. And finding those 20% remains one of the most complex tasks. So why hamper development by doing everything in C? Of course C is an excellent choice, but it hampers fast-paced tinkering and evolution enormously. I don't mean the compile-run cycle, which is almost unnoticeable these days, but the fact that compilation is platform-specific, meaning that a solution has to be built and tested again for each platform involved. That is simply not good enough - it is the stone age, when we compare it with how scripted software development races forward these days. That is why I propose to consider, ahem, Forth as the new low-level glue of scripting. While this language is absolutely horrendous for human use, it really need not play a dominant role at all. It can be the vehicle to glue C-based modular routines together, doing the impedance matching of moving arguments around, doing basic datatype conversions, basic high-speed iteration, and so on. I'm not advocating to rewrite things as basic as strlen() in Forth, but simply to hook lots of these calls up as Forth primitives, so that they can be efficiently glued together. The term glue is most appropriate here, since these are bindings which form a core which gets refined and tweaked, but will rarely, if ever, be taken apart again. A foundation, put together by experts, with an initial spur of creativity and evaluation, ending in a long-lasting bond between what will be sublimely efficient en well modularized core components. Some of it will be turned into platform-specific assembly code (for which again Forth has everything in place already). These are the Formula-1 race tracks, where the most talented thrill-seekers meet to create race engines which stretch the limits of what silicon can do. Everyone else just marvels at the results, and enjoys the power that ends up being in their high-level scripting environment.

High-level nuts and bolts are for productivity

With the racing engine in place, the stage is set for a really enjoyable voyage. There are many languages, and there will be many more, which will meet the needs of a huge variety of application programmers. There will never be a single best language, though some will be more popular than others - simply because some application domains and some skill levels are much more wide-spread than others. Python, Tcl, Perl - what could better illustrate how widely varied the human mind operates? And that's just using examples from declarative and very similar scripting languages. As each of these languages illustrates, real productivity comes not just from language features, but from the supporting and highly evolving set of libraries and extensions. Yet here again, we are in the stone age. Or perhaps rather: in medieval times, with lots of little kingdoms, all different, yet so extremely alike. While it is very valuable to have competition to breed variety and have darwinian survival mechanisms bring us the best we can produce on this planet, there is a point where we should look back, evaluate what there is, and reduce the variety down to the best solutions. Variety is stifling once it comes to moving on, when you're trying to build on top of what has been done before. If I were to create an FTP-based synchronzation tool, for example, why should I have to choose my development tool based on which one has the best FTP implementation. Shouldn't we be able to consider FTP a solved issue, have it integrated into several scripting languages in a generic way, and then decide how to proceed on more conceptual grounds? There will be trade-offs, but as scripting has shown, many trade-offs are fine, such as having stdio or TCP/IP used as core technologies. With a scripting kernel, and its goal of serving multiple languages, syntaxes, and programming models - even at the cost of slight performance loss - one could try to hve FTP implemented once, and use it easily from Tcl, Perl, Python. This won't happen soon, but I expect the stdio's, regexp's, hashing techniqes, and lots more to eventually sneak their way into this approach - as well as a dozen internet protocol handlers, a truckload of database connectivity solutions. With a mind open for compromises, it is not hard to pick lots of candidates for generic integration. Despite popular belief, compromising is easy - scripting compromises raw performance and access to all paltform-specific features all the time, and look how far it is getting us. Taking building blocks and connecting them is at the heart of scripting. This is what makes Scripting the Lego and Fischer Technik of computer software. Call it Meccano even - because tying things together, taking them apart again, adding on, fixing, and extending is what lets us all be so productive. This is why nuts and bolts makes sense: modular packages, made to become very strong, yet reconfigurable forever.

Now what do I do with my C/C++ code?

What has been described so far, comes down to keeping all script languages intact, selecting generic implementations for many domains from the huge body of script libraries, and using a multi-language scripting kernel to "orthogonalize" and merge code across languages. Not as a goal in itself, but to focus our collaborative efforts and to move past issues which can be considered "solved" for most practical purposes. At the same time, the low-level code and today's scripting implementations are far too rigid. The Tk GUI is a brilliant piece of engineering. But it is too big and has stopped evolving, because such a large package should not be written entirely in C. Similarly, the Tcl, Python, and Perl implementations are stone-age monoliths. They cannot cope with change, they cannot be improved or extended easily or fast enough. Note that the issue of remaining backward compatible is not at stake - there is no reason why replacing core data structures in the scripting kernel would have to affect compatibility. But what has to be done, is to replace the fabric with which Tcl, Tk, Python, and Perl have been built - without throwing out much code. This may sound impossible, but it might be not be at all. Suppose one were to extend SWIG to generate trivial interfaces to Forth (I'm using pForth for now, a C-based implementation), as well as a tiny bit of Forth glue to embed calls. Then, run all of Tcl, Python, and Perl through SWIG. If - at the C level - the design is sufficiently modular, that would instantly add the capability to replace series of internal C calls with Forth, and therefore even in the scripting language itself. Take a simple task such as sourcing/using/importing a script file and executing it. We can keep things like: open/read/close file, and eval. But we could esily rewrite all the housekeeping and checking involved in doing, say, Tcl's "source" in Tcl. Or in Python. The gain? Modularity of the C code, with all special-casing, neat over-the-net loading ideas for later, and so on being done in as fast-forwrd scripting. No more testing on all platforms. No more compiles. This is just scratching the surface, of course. But yes, I think one can re-engineer 80% of today's Tcl without C. Same for Tk, and all other languages. As for hashing, stdio, regexp, and such: of course that will remain C. But drastically modularized.

A tiny scripting kernel

The key to performance is modularity . First of all, 80% of performance is definitely caused by 20% of the code. The rest is totally irrelevant (in terms of speed). Figuring out what those 20% are, and getting them right is crucial. Ignoring that issue for the other 80% is equally important. Modularity helps in two ways: first of all, when you find out precisely how interactions work, you are halfway in finding out which interactions determine performance. And the second benefit in seeking extreme modularity, is that one can experiment and plug in different approaches to come up with an optimal one. As operating system designs have demonstrated with micro- and nano-kernels, there is no reason why the core of an application - the implementation of a scripting language in our case - has to be large. It's all bits, representing data as well as actions. The only "large" issue is the choice and design of the core data-structures and API's. The leaner and meaner the core, the more pluggable and flexible the result. In other words, it's about time we started looking at the core of scripting as an engine. And not just for one language, but for a range of similar languages - let's face it: Perl, Python, and Tcl are really 95% the same - primitive data types, generic data structures, system calls, dynamic modules, matching / regular expressions, string manipulation, type conversions, file I/O, network interfaces, and on and on... As Forth demonstrates, when there is no compiler nor monolithic bytecode/token interpreter, an inner interpreter can be just a few hundred bytes (it's partly a matter of definition, of course). Let's aim for a tiny Scripting Kernel - and see how far it gets us in linking together popular code bases and operating system interfaces. And let's plug in language front ends and clever data structure implementations.

Evolution, not revolution

Introducing a scripting kernel, and taking on so many languages, and adding an entire Forth system, is going to be a huge project. Looking at what we currently have in C, I think it implies dealing with, and ripping apart, maybe up to a million lines of C code. The traditional approach is to pump up the adrenaline, get a lot of people very excited about this, announce a new project, stretch expectations to the breaking point, and kick it all off with a we-will-revolutionize-the-world type of announcement and press-release. Which then takes a few years before failing miserably. Yikes. Let's not do that. How about the following: create a three-headed monster, being a single statically-linked executable containing Tcl, Tk, Python, and Perl. Add pForth to make matters worse :) Implement the SWIG trick to bring everything into Forth as callable code. Start modularizing on a dozen different ends, given that Forth can glue A's regexp code to B's context, for example (and I/O, and channels, and stubs, and, and, and...). Get the cheapest results first, throwing out massively redundant code. Keep things working while doing all this. Postpone those ideas which cannot satisfy this requirement. Then, when it is clear what there is and how it hangs together, start introducing new data structures with as sole aim to simplify existing ones. This will mean that the code will have periods of not working at all. Don't aim very high - not perfect data structures and algorithms, just a few which help integrate between languages. Try to get I/O, regexps, and hashing in line, for example. I think many of the activities just mentioned can be "massively parallelized" over several developers. What is happening is that a group of people will learn what modularity really means, and that a few specialists can adopt/introduce specific technologies, and see whether they work out. Once this is well underway, it will become obvious I expect, how to configure single-language builds of this super-monolith. At the same time, stubs can make their appearance here, to turn the newly discovered/created modularity into shared-library modularity. This might well be a mere technicality, once we undertand how modules interact. Some issues which can be tackled almost completely independently are: 1) abstract syntax parse trees, as foundation for new logic representations and to open up introspection and self-compilation, 2) using explicit call stacks to implement coroutines, generators, persistence, and resumability, 3) generating Forth code on the fly, to supplement byte-code interpretation, and to experiment with pluggable platform-specific machine code optimizers, 4) wrapping/packaging, 5)

Lots of tough choices...

Evidently, there are many compromises ahead. To mention a few, and my current preferences: Unicode support will be limited to "just" UTF-8 and a corresponding set of string disect/compare functions. Highly vectorized combination of low-level code (which is a way of saying that stub-vectors will be king). No pre-emptive threading (but explicit stacks will easily support collaborative threading). A core so tiny you will wonder where it is (meaning: a trivial Forth is the only discernible center there is). Pluggable data structures, pluggable O/S interfaces, pluggable modules, pluggable "skins" (meaning: extreme scrutiny and conservatism in module dependencies). Performance is important but will be sacrificed when modularity is at stake (rationale: clever people will find more efficient solutions later anyway, so the most important goal is to make sure they can implement their ideas when that time comes). Oh, and to kill one issue once and for all: it's open source all the way, with a Tcl/Python-like you-do-anything-you-like-with-it style license (unlike the rest, this is non-negotiable).

Winding down

As announced at the beginning, this paper has no clear ending. I think everything mentioned can be attributed to the fact that scripting is a huge step forward in programmer productivity at what I'd like to call the "nuts & bolts" level, but that today's implementations are stifling evolution at underlying levels. There is no low-level glue flexibility at all. Yet as a systems-programmer, I desperately need to fiddle at that low level. Forth may be the answer to this dilemma, since it is small, portable, high-performance, and infinitely adaptable. Unfortunately, introducing Forth at this stage, means we have to rip apart highly succesful (and collossal/monolithic) pieces of software. That's a monumental task, and the big question is: can the open source communities of today adapt to this challenge, and do they want to?

� June 1999

Starkits are a lifestyle

Perfection is the ideal, but the enemy of done

Virtual reality

The Designer's Stance

The strength of Tcl

One year later

Our past is holding us back

Outside the box

Musings of a maverick

An exponential rant

Why compilers are doomed

The Jericho itch

The nuts and bolts of scripting

Scripting - the fourth generation

Tcl (over-) flow