The design and implementation of a language for extending applications

reprint from Proceedings of XXI Brazilian Seminar on Software and Hardware (1994) 273–283. [ps]

Abstract.

We describe the design and the implementation of Lua, a simple, yet powerful, language for extending applications. Although Lua is a procedural language, it has data description facilities, and has been extensively used in production for several tasks including user configuration, general-purpose data-entry, description of user interfaces, description of application objects, and storage of structured graphical metafiles.

Introduction

There is increasing demand for customizable applications. As applications became more complex, customization with simple parameters became impossible: users now want to make configuration decisions at execution time; users also want to write macros and scripts to increase productivity (Ryan 1990). Therefore, nowadays, larger applications almost invariably carry their own configuration or scripting languages for end-user programming. These languages are usually simple, but each has its own particular syntax. As a consequence, users have to learn (and developers have to design, implement and debug) a new language for each application.

Our first experience with a proprietary scripting language arose in a data-entry application, for which a very simple declarative language was designed (Figueiredo–Souza–Gattass–Coelho 1992). (Data-entry is an area where user defined actions are especially needed because pre-coded validation tests can hardly be adequate for all applications.) When users began to demand increasingly more power in this language, we decided that a more general approach was needed and started the design of a general-purpose embedded language. At the same time, another declarative language was being added to a different application for data description. Therefore, we decided to merge the two languages into one, and designed Lua to be a procedural language with data description facilities. Lua has since outgrown its original roots and is being used in several other industrial projects.

This paper describes the design decisions and the details of our implementation of Lua.

Extension languages

The use of languages for extending applications is now recognized as an important design technique: it allows a cleaner design for the application and it provides configuration by the user. Because most extension languages are simple, specialized to a task, they have been called “little languages” (Bentley 1986; Valdés 1991), in contrast to the “big”, mainstream languages, in which applications are written. This distinction is not so sharp nowadays, since a major part of several applications is actually written using an extension language. Extension languages come in several flavors:

What makes embedded languages different from standalone languages is that embedded languages only work embedded in a host client, called the embedding program. Moreover, the host program can usually provide domain-specific extensions to the embedded language, thus creating a version of the embedded language customized for its own purposes, possibly by providing higher level abstractions. For this, an embedded language has both a syntax for its own programs and an application program interface (API) for interfacing with hosts. Thus, unlike simpler extension languages, which are used to supply parameter values and sequences of actions to hosts, there is a two-way communication between embedded languages and host programs. Note that the application programmer interfaces with an embedded language in the mainstream language used for host programs, whereas the user interfaces with the application using solely the embedded language.

LISP has always been a popular choice for extension languages, for its simple, easily parsed syntax and built-in extensibility (Beckman 1991; Nahaboo). For instance, a major part of Emacs is actually written in its own variant of LISP; several other text editors follow the same path. However, LISP cannot be called user-friendly when it comes to customization. Neither can C and the shell languages; the latter even have a more complicated, unfamiliar syntax.

One of the fundamental decisions made in the design of Lua was that it should have a clean but familiar syntax: we quickly settled for a simplified Pascal-like syntax. We avoided a syntax based on LISP or C because it could be discouraging to outsiders or non-programmers. Thus, Lua is primarily a procedural language. Nevertheless, as already mentioned, Lua acquired data description facilities to increase its expression power.

Lua concepts

Lua is a general purpose embedded programming language designed to support procedural programming with data description facilities. Being an embedded language, Lua has no notion of a “main” program; it only works embedded in a host client (Lua is provided as a library of C functions to be linked to host applications). The host can invoke functions to execute a piece of code in Lua, can write and read Lua variables, and can register C functions to be called by Lua code. With registered C functions, Lua can be augmented to cope with rather different domains, thus creating customized programming languages sharing a syntactical framework (Beckman 1991).

This section contains a brief description of the main concepts in Lua. Some examples of actual code are included, to give a flavor of the language. A precise definition of the language can be found in its reference manual (Ierusalimschy–Figueiredo–Celes 1994).

Syntax

As mentioned before, we explicitly designed Lua to have a simple, familiar syntax. As a consequence, Lua supports an almost conventional set of statements, with implicit, but explicitly terminated, block structure. The conventional statements include simple assignment; control structures such as while-do-end, repeat-until, if-then-elseif-else-end; and function calls. Non-conventional statements include multiple assignment; local variable declarations, which can be placed anywhere inside a block; and table constructors, which may contain user defined validation functions (see below). Moreover, functions in Lua can take a variable number of parameters and can return many values. This avoids the need for passing parameters by reference when more than one result need to be returned.

Environment and modules

All statements in Lua are executed in a global environment. This environment, which keeps all global variables and functions, is initialized at the beginning of the embedding program and persists until its end. The global environment can be manipulated by Lua code or by the embedding program, which can read and write global variables using functions in the library that implements Lua.

The unit of execution of Lua is called a module. A module may contain statements and function definitions, and may be in a file or in a string inside the host program. When a module is executed, first all its functions and statements are compiled, and the functions added to the global environment; then the statements are executed in sequential order. All modifications a module effects on the global environment persist after its end. Those include modifications to global variables and definitions of new functions (a function definition is actually an assignment to a global variable; see below).

Data types and variables

Lua is a dynamically typed language: variables do not have types; only values do. All values carry their own type. Therefore, there are no type definitions in the language. The absence of variable type declarations, apparently a minor point, is actually an important factor in simplifying the language; it is frequently presented as a major feature in many variants of typed languages modified to be used as extension languages. Moreover, Lua has garbage collection: it keeps track of which values are being used and discards the ones that are not. This avoids the need for explicit managing memory allocation, a major source of programming errors. There are seven basic data types in Lua:

Lua provides some automatic type conversions. A string taking part in an arithmetic operation is converted to a number, if possible. Conversely, whenever a number is used when a string is expected, that number is converted to a string. This coercion is useful because it simplifies programs and avoids the need for explicit conversion functions.

Global variables do not need declaration; only local variables do. Any variable is assumed to be global unless explicitly declared local. Local variable declarations can be placed anywhere inside a block. Therefore, because only local variables are declared, and these declarations can be made close to the use of the variable, it is usually simple to decide whether a given variable is local or global.

Before the first assignment, the value of a variable is nil. Therefore, there are no uninitialized variables in Lua, another major source of programming errors. Nevertheless, the only valid operations on nil are assignment and equality test (the main property of nil is to be different from any other value). Therefore, using an “uninitialized” variable in a context where an “actual” value is needed (e.g., an arithmetic expression) results in an execution error, alerting the programmer that the variable was not properly initialized. Thus, the purpose of automatically initializing variables with nil is not to encourage the programmer to avoid initializing variables, but rather to enable Lua to signal the use of actually uninitialized variables.

Functions are considered first-class values in Lua: they can be stored in variables, passed as arguments to other functions and returned as results. When a function in Lua is defined, its body is compiled and stored in a global variable with the given name. Lua can call (and manipulate) functions written in both Lua and C; the latter have type Cfunction.

The type userdata is provided to allow arbitrary (void*) C pointers to be stored in Lua variables; its only valid operations in Lua are assignment and equality test.

The type table implements associative arrays, that is, arrays that can be indexed with both numbers and strings. Therefore, this type may be used not only to represent ordinary arrays, but also symbol tables, sets, records, etc. To represent a record, Lua uses the field name as an index. The language supports this representation by providing a.name as syntactic sugar for a["name"].

Associative arrays are a powerful language construct; many algorithms are simplified to the point of triviality because the required data structures and algorithms for searching them are provided by the language (Aho–Kerninghan–Weinberger 1988; Bentley 1988). For example, the core of a program that counts the occurrences of each word in a text can be written

Tables can be created in many ways. The simplest way corresponds to ordinary arrays:

There are two alternative syntaxes for creating tables without explicitly filling each entry: one for lists (@[]) and one for records (@{}). For instance, it is much easier to create a list by providing its elements, as in

The application program interface

The library that implements Lua has an API, i.e., a set of C functions for interfacing Lua with host programs (there are approximately 30 such functions). These functions characterize Lua as an embedded language, and handle the following tasks: executing Lua code contained in a file or in a string; converting values between C and Lua; reading and writing Lua objects contained in global variables; calling Lua functions; registering C functions to be called by Lua, including error handlers. A simple Lua interpreter can be written as follows:

Predefined functions and libraries

The set of predefined functions in Lua is small but powerful. Most of them provide features that allow some degree of reflexivity in the language. Such features cannot be simulated with the rest of the language nor with the standard API. The predefined functions handle the following tasks: executing a Lua module contained in a file or string; enumerating all fields of a table; enumerating all global variables; type querying and conversion.

The libraries, on the other hand, provide useful routines which are implemented directly through the standard API. Therefore, they are not necessary to the language, and are provided as separate C modules, which can be linked to applications as needed. Currently, there are libraries for string manipulation, mathematical functions, and input and output.

Persistence

The enumeration functions can be used to provide persistence of the global environment within Lua, i.e., it is possible to write Lua code that writes Lua code that, when executed, restores the values of all global variables. We now show some ways to store and retrieve values in Lua, using a text file written in the language itself as the storage media. To restore values saved in this way, it is enough to execute the output file.

Implementation

Extension languages are always interpreted, in one way or another, by the application. Simple extension languages can be interpreted directly from source code. On the other hand, embedded languages are usually powerful programming languages, with complex syntax and semantics. A more efficient implementation technique for embedded languages is to design a virtual machine suited to the needs of the language, compile extension programs into bytecodes for this machine, and then simulate the virtual machine by interpreting bytecodes (Betz 1988, 1991; Franks 1991). We have chosen this hybrid architecture for implementing Lua; it has the following advantages over direct interpretation of source code:

This architecture was pioneered in Smalltalk (Goldberg–Robson 1983; Budd 1987) (from which the term bytecodes was borrowed) and also used in the successful UCSD Pascal system based on P-code (Clark–Koehler 1982). In these systems, bytecodes for virtual machines were used both for reducing complexity and for increasing portability. This path was also used in porting the BCPL compiler (Richards–Whitby-Strevens 1980).

Code for compilation of extension programs can be built with standard tools, such as lex and yacc (Levine–Mason–Brown 1992). The existence of good tools for compiler construction, which became widely available in the late seventies, was the main reason for the sprouting of several little languages, specially in Unix environments. Our implementation of Lua uses yacc for syntactical analysis. Initially, we wrote the lexical analyzer using lex. After performance profiles with production programs, we detected that this module was responsible for almost half of the time required for loading and executing extension programs. We then rewrote this module directly in C; the new lexical analyzer is more than twice as fast as the old one.

Lua's virtual machine

The virtual machine used in our implementation of Lua is a stack machine. This means that it does not have random access memory: all temporary values and local variables are kept in a stack. Moreover, it does not have general purpose registers, only special control registers, which control the stack and the execution of programs. These registers are base of stack, top of stack and program counter.

Programs for the virtual machine are sequences of instructions, called bytecodes. The execution of programs is achieved by interpreting bytecodes, each corresponding to an instruction that operates on the top portion of the stack. For example, the statement

Many of the instructions exist for optimization only. For instance, there is a PUSH instruction, which takes a number as a parameter and pushes it onto the stack, but there are also single-byte optimized versions for pushing common values such as zero and one. Thus, we have PUSHNIL, PUSH0, PUSH2, PUSH3. Such optimizations reduce both the space required for compiled bytecodes and the time required for interpreting instructions.

Recall that Lua supports multiple assignment and multiple return values from functions. Therefore, sometimes, a list of values must be adjusted, at run time, to a given length: if there are more values than are needed, then the excess values are thrown away; if more values are needed than are present, then the list is extended with as many nil's as needed. Adjustment is done on the stack with the ADJUST instruction.

Although multiple assignment and returns are a powerful feature of Lua, they are an important source of complexity in both the compiler and the interpreter. Because there are no type declarations for functions, the compiler does not know how many values a function will return. Thus, adjustment must be done at run time. Similarly, the compiler does not know how many parameters a function takes. Because this number may vary at run time, the list of parameters is bracketed between a PUSHMARK and a CALLFUNC instruction.

One way to extend Lua with functions provided by the host would be to assign a bytecode to each such function (Betz 1988). Although this strategy would simplify the interpreter, it has the disadvantage that fewer than 200 external functions could be added, because Lua has 8-bit bytecodes and already uses about 60 of them for its primitive instructions. We have chosen to have the host explicitly register external functions, and handling these functions like native Lua functions. Thus, there is a single CALLFUNC instruction; the interpreter decides what to do based on the type of the function being called.

A rather different strategy was proposed by Franks (1991): all external functions in the host can be called by the embedded language; no explicit registration is needed. This is done by reading and interpreting the map generated by the linker. This solution is very convenient for the application programmer, but is not portable, being dependent on the format of the map file and on the relocation strategy used by the operating system (Franks used a specific compiler for DOS).

Internal data structures

As mentioned before, variables in Lua are not typed; only values are. Thus, values are implemented in a struct with two fields: a type and a union containing the actual value. These structs occur in the stack and in the symbol table, which holds all global symbols.

Numbers are stored directly into the union. Strings are kept in a single array; values of type string contain pointers to this array. Values of type function contain pointers to a bytecode array. Values of type Cfunction contain the actual pointer to the C function, as provided by the host program; the same happens for values of type userdata.

Tables are implemented as hash tables, with collisions handled by separate chaining (this explains why indices in a table are ordered arbitrarily). If a dimension is given when a table is created, then this dimension is used as the size of the hash table. Thus, by providing a dimension approximately equal to the expected number of indices in the table, few collisions will occur, resulting in very efficient index location. Moreover, if the table is used as an array, with numeric indices only, then choosing the right dimension at creation time guarantees that no collisions will occur.

All internal data structures in Lua are dynamically allocated arrays. When there are no more free slots in one of these arrays, garbage collection is automatically performed using a standard mark-and-sweep algorithm. If no space is recovered (because all values are being referenced), then the array is reallocated with double its current size.

Garbage collection is very convenient for the programmer because it avoids explicit memory management. When Lua is used as a standalone language (which it frequently is), then garbage collection is an asset. However, when Lua is used embedded in a host program (which is its main purpose), then garbage collection creates a new worry for the application programmer who needs to interface with Lua: care should be taken not to store Lua tables and strings into C variables, because these values may be reclaimed during garbage collection, if they do not have any further references within Lua's environment. More precisely, the programmer must explicitly copy these values into C variables, before returning control to Lua. Although this is a different paradigm, it is not worse than the familiar malloc-free protocol for memory management using the standard C library.

Conclusion

Lua has been extensively used in production since mid 93, for the following tasks:

The ability to load and execute Lua programs at run-time has proved to be a major component in making configuration an easy task for both users and developers. Moreover, the existence of a single general purpose embedded language discourages the multiplication of incompatible languages and encourages a better design, one that clearly separates the main technology contained in an application from its configuration issues.

Acknowledgments

We would like to thank the staff at ICAD and TeCGraf for using and testing Lua. The industrial applications mentioned in the text are being developed in partnership with the research centers at PETROBRAS (CENPES) and at ELETROBRAS (CEPEL).

References

M. Abrash, D. Illowsky, "Roll your own minilanguages with mini-interpreters", Dr. Dobb's Journal 14 (9) (Sep 1989) 52–72.

A. V. Aho, B. W. Kerninghan, P. J. Weinberger, The AWK programming language, Addison-Wesley, 1988.

B. Beckman, "A Scheme for little languages in interactive graphics", Software, Practice & Experience 21 (1991) 187–207.

J. Bentley, "Programming pearls: little languages", Communications of the ACM 29 (1986) 711–721.

D. Betz, "Your own tiny object-oriented language", Dr. Dobb's Journal 16 (9) (Sep 1991) 26–33.

R. Clark, S. Koehler, The UCSD Pascal handbook: a reference and guidebook for programmers, Prentice-Hall, 1982.

L. H. de Figueiredo, C. S. de Souza, M. Gattass, L. C. G. Coelho, "Geração de interfaces para captura de dados sobre desenhos", Anais do SIBGRAPI V (1992) 169–175 [in Portuguese].

N. Franks, "Adding an extension language to your software", Dr. Dobb's Journal 16 (9) (Sep 1991) 34–43.

A. Goldberg, D. Robson, Smalltalk-80: the language and its implementation, Addison-Wesley, 1983.

R. Ierusalimschy, L. H. de Figueiredo, W. Celes Filho, "Reference manual of the programming language Lua", Monografias em Ciência da Computação 4/94, Departamento de Informática, PUC-Rio, 1994.

C. Nahaboo, A catalog of embedded languages, available from colas@indri.inria.fr.

M. Richards, C. Whitby-Strevens, BCPL: the language and its compiler, Cambridge University Press, 1980.

R. Valdés, "Little languages, big questions", Dr. Dobb's Journal 16 (9) (Sep 1991) 16–25.

SEMISH'94 paper

The design and implementation of a language for extending applications

Abstract.

Introduction

Extension languages

Lua concepts

Syntax

Environment and modules

Data types and variables

The application program interface

Predefined functions and libraries

Persistence

Implementation

Lua's virtual machine

Internal data structures

Conclusion

Acknowledgments

References