Tech Notes: Software Toolkits

Table of contents

  1. Tools and Toolkits
  2. Software Tools and Toolkits
  3. Unix/Gnu Text-Oriented Tools
  4. XML Text Oriented Tools
  5. Relational Database Tools
  6. Procedure & Object-Oriented Tools
  7. Symbolic-Expression Tools
  8. General Knowledge Representation Tools

Context for these notes

These notes are inspired by the first Tech Salon meeting which began the conversation about software tools and toolkits. Software tools and toolkits are a very large subject area with lots of fuzzy distinctions. While this document is a work in progress it may serve as an introduction to the topic, especially for someone joining our software tools enthusiasts group.

A human being without tools has little power in the world. A tool enables a person skilled in its use to be highly productive with specific tasks. We have developed our skills and our tools together; they are interdependent. Our tools are generally organized in toolkits oriented towards specific application areas, such as:

Please take a few moments right now to identify some of your toolkits and their application areas, possibly including the areas in the list above. Here are some questions for you to answer in regards to your toolkits: It is interesting to contrast essential tools, such as a cooking knife, from less general, more specialized productivity tools, such as a food processor. It is also interesting to note how we will have a collection of similar tools with overlapping functionality, e.g. our set of cooking knives and of cooking pots of different sizes and shapes. We could do everything with a much reduced set of tools but it is more convenient to have a larger kit as long as it is well organized - productivity can be lost if we've having to fuss too much with finding and managing our tools. All of these characteristics of common physical tools are also characteristics of software tools.

Returning to the toolkits you identified above, please answer these questions:

Software Tools

Software Tools are collections of deployable modules of code which make us productive in solving problems whose nature is information, including

Notice that more and more of the activity of modern life is empowered or impeded by the availability of software tools and by their quality. Notice especially the last item on the list. When we use our kit of software tools for writing software to improve that very kit we are stepping into the area of recursive self-improvement, an extremely powerful exponential process.

Toolkits are typically organized by application areas. Software tools are alternatively or additionally organized by the data formats which they understand. Tools can often be creatively employed across application area as long as they can process the data. There can be a lot of power in keeping data formats very open and general so that more tools are potentially applicable. There is a creative tension between the efficiency of custom data formats and flexibility of more general data formats.

Note that simple data formats basically leave all the work of understanding the meaning of the data to a human observer. For example, a table of numbers has no objective meaning. At a higher level, an XML Document or a Relational Table when accompanied by their Schemas become partially self-descriptive. Programs can automatically treat the data appropriately, which raises it to the level of information. At the highest level, Knowledge Representation formats allow the meaning of information to be put intelligently to use by general-purpose programs. It is very useful to have software toolkits spanning all of these levels.

Unix/Gnu Text-Oriented Tools

The Unix Operating System promoted the concept of general-purpose tools which could process nearly any kind of data as long as that data was expressed as lines of human-readable text. The GNU Project improved and extended the original Unix text toolkit to create the tools which are now standard on the Gnu/Linux platform and very popular among sophisticated users and professionals on all modern platforms.

Much of the power of using lines of text to represent data comes from the ease with which humans can directly understand the data. The developers at Bell Labs who developed the first version of these tools also incorporated a sophisticated pattern matching system called Regular Expressions which made it possible to deal with complex data formats within, and sometimes across, lines of text.

Lines of text describable by Regular Expressions starts to be a poor format when the data

Some important tools which empower users working with data organized as lines of text include

Emacs
Emacs has hundreds of modes for efficiently editing diverse and complex text-based formats. Using it interactively requires learning the keymaps which bind user gestures (keystroke and mouse actions) to editing functions, You can extend Emacs or use it as an automated tool by learning Emacs-Lisp, a powerful programming language with special features for editing textfiles and controlling interactive processes.
sed, ed, ex, vi, vim
These are some of the simpler text-editing tools which are sometimes used in preference to the more powerful but also more complex Emacs.
awk, tcl, perl , ruby
These are increasingly complex programming languages which are particularly handy for manipulating data expressed as text
cat, grep, head, tail, cut, sort, comm, diff, tr, join, et al.
These are some of the very simple little text-processing programs which can be joined together in pipelines to accomplish remarkably complex tasks from an interactive shell or a shell-script.

XML Text Oriented Tools

Lines of Text and Regular Expressions cannot easily represent data which is hierarchical in nature, i.e. data involving nested patterns. These deficiencies lead many developers towards an emerging toolkit using XML-based hierarchical formats. XML formats (or XML languages) such as XHTML for web pages and ODF for office documents are still human readable text but the format of the data is given by explicit tags rather than by lines and delimiters. In fact in XML-based formats lines (and indentation) are no longer important except to make the data flow nicely when the data is being looked at by humans.

Some important tools which empower users working with XML-based text include

Emacs
Emacs has several special modes for editing XML-based text
XSLT processors such as xsltproc
allow translation of XML documents into other formats
XPATH
allows selection of data from within large, complex XML documents
XML databases
allow for efficient storage and access of data in XML formats

Relational Database Tools

Representing and processing data as lines of text or as XML documents becomes a problem as the size of the dataset grows and is also a problem when a user requires complex correlations across the data. These scaling issues lead to a desire for more compact binary formats accompanied by efficient index structures. Binary formats can easily be more than ten times more compact than text and more than 50-times faster to process (because of not having to parse the data). Suitable choices of indexes can often reduce the time to perform a complex operation from enormously long (hours, days, years?) to a few seconds or less.

Relational databases are the most popular and are very general. Many people believe that relational databases cannot efficiently express hierarchical or network organized data but this is not true. While the relational model underlying relational databases is very general and powerful, it is often necessary to write and maintain complex schemas to get those advantages. Many relational databases have extensions which especially cater to XML-formatted text data. Additional extensions and metaprogramming techniques can further extend the advantages of the relational model.

Procedure & Object-Oriented Tools

Entities (Data Objects) in Procedural Languages such as

and Functional Languages such as live in the Virtual Memory Space of the Process running a Computer Program. Normally when the Process running a Computer Program stops, the Virtual Memory Space holding the data is discarded and any objects which have not been saved in some manner outside of the Process are lost. It can also be difficult for objects running in one Process to communicate and coordinate with objects running in other Processes.

Here are some strategies for Objects to save and restore themselves between executions of their Programs by a Process:

Flat Files
Objects can serialize themselves (convert their data into a sequence of bytes) to store their contents in an ordinary data file with no special structure. The objects can later be restored by fetching and deserializing their stored bytes.
Network Object Repository
Objects can serialize themselves and transmit their contents to another process which acts as a repository from which they can be later retrieved and restored by deserialization.
Relational Database
Objects can store their contents as one or more rows of relational tables, converting the data in their fields into the values of fields (columns) of those relational rows. The objects can later be retrieved by retrieving their data using a relational query and reconstructing the object. Saving and restoring objects using relational queries is extremely expensive; however, the many benefits of relational databases can make this system worthwhile.
Saved Virtual Memory
A program can write its entire occupied virtual memory space out to a flat file using a memory map operation. This is usually integrated into a compacting garbage collection operation so that there are no unused gaps in the saved virtual memory. That entire virtual memory space can later be recovered by another run of the same program with another memory map operation. Memory maps are the most efficient Input/Output mechanism on modern computers. This method has been used by some Lisp systems and is the standard way of saving and restoring Smalltalk sessions.

Within the Virtual Memory Space of an executing program, software tools are primarily organized as procedures called functions or methods. Procedures are the most efficient and flexible kind of software tool. Procedures have a very fine granularity, which means that procedural toolkits (often grouped into Interfaces, Service Packages and APIs) can become exceedingly complex. Refactoring tools can be very helpful for managing the complexity of procedural toolkits.

To coordinate their activity across multiple processes, which may span multiple computers across computer networks, objects employ many strategies, such as

Remote Procedure (i.e. Function or Method) Calls
Objects serialize a procedure call's data and transmit it across a network socket to be executed by a method of an object running in another Process. The result of the procedure call is then serialized and sent back. This can expand the cost of a method call by a factor of 1,000,000 or more if the program is not very carefully organized.
Custom Asynchronous Wire Formats
Objects send processing requests over network streams without waiting for an immediate reply, and asynchronously receive replies from previous requests. The protocol is carefully designed to minimize delays for the intended application.

Symbolic-Expression Tools

When Steve_Russell implemented the first version of the Lisp Programming Language (in 1958!), he discovered that he could use the same representation for the Lisp Programs as those Programs used to represent Lisp Data, a very general representation called Symbolic Expressions or S-Expressions for short. This made Lisp the first Homoiconic Language. Lisp programs consist of procedures called functions. Since Lisp functions are written as S-Expressions and Lisp functions can easily generate and manipulate S-Expressions, it is especially easy to write Lisp functions to write other Lisp functions or even whole Lisp programs. This metaprogramming capability makes Lisp a particularly powerful environment for building software tools. The original Lisp has spawned many dialects over the years and Lisp-family languages remain favorites of programmers dealing with highly challenging application areas.

In some ways software tools based on Lisp face the same challenges as the other Object-Oriented systems mentioned above, but in some ways Lisp's structure makes things easier.

General Knowledge Representation Tools

Knowledge Representation is the more general encoding of human-meaningful information into forms which allow meaningful computation by general-purpose computer programs. Many powerful and successful knowledge representation systems have been invented and implemented along with accompanying software tools. Ultimately this area has the greatest potential for applying computation to serve human needs, yet most of today's computer programmers are mostly or entirely unaware of it! Users often say "I just want to tell the computer what I want and have it do it." The key to having a computer program understand what you want is a user interface which translates your wishes into a suitable Knowledge Representation format.

The general Wikipedia article on Knowledge Representation is a great place to start, but I'm going to list here a few of the key Knowledge Representation systems which are available and potentially useful.

Predicate Logic
Ever since Haskell Curry and William Howard figured out (starting in 1934!) that running a computer program was isomorphic to finding a proof of a logical conjecture, computer scientists have been working on ways to represent problems to be solved as problems in pure logic and then give them to an automated theorem proving program to solve. In this way all programming problems turn into Logic Programming problems. The problem to be solved can be described in a variety of forms of Predicate Logic , the most general-purpose and powerful of all Knowledge-Representation formats. It's fairly easy to represent predicate logic formulas as computer data, either directly or encoded as S-Expressions. It turns out that solving problems represented as general predicate logic formulas tends to be computationally expensive but there are software tools which can assist a programmer in refactoring predicate logic formulas into more tractable forms.
Horn-Clauses
Horn Clauses are a simple and restricted form of Predicate Logic. Although less expressive than general predicate logics, Horn Clauses can represent many kinds of information and knowledge quite neatly and algorithms are known which can efficiently compute with Horn Clause databases. Many Logic Programming languages have efficient support for Horn Clause logic built-in, even compiling it to native code for efficiency rivaling that of conventional compiled programming languages like C. The Prolog language family uses Horn Clauses for both programs and data, making it another Homoiconic language. Numerous excellent software tools exist for working with Horn Clause logic programs.
Semantic Networks
Semantic Networks are the simplest method for representing knowledge as a network or graph structure. They work very well for simple constrained semantic domains but run into problems when applied to larger and more open domains. They have recently come back into vogue with the idea of the Semantic Web and many sophisticated tools are being built to make it easier to apply them in this context. My own most successful Knowledge Representation System was based on Semantic Networks, so I know them very well, including their limitations.
Frames and Misc. Ad-Hoc Systems
Frames and related ad-hoc systems can condense Semantic Networks into larger chunks (the Frames) which improve efficiency for typical use-cases. Many Frame-oriented systems have been built. One of the most complete examples of an advanced Knowledge Representation Toolkit is OpenCyc.
Schemata
Schemata are a model proposed by Cognitive Psychologists and Cognitive Scientists as the overall Knowledge-Representation system of the human mind and of any practical intelligent system. Many advanced computer knowledge-representation systems have been inspired by Schemata Theory. This was my proposed area of study in graduate school. I do not know of any good Schemata-oriented software tools currently available. Schemata continue to be a very promising area for the future and a good place to end this survey!

How to contact me

Use the handy Pu'uhonua Contact Page


Return to Touch Pu'uhonua's Writings Return to Touch Pu'uhonua