This first edition was written for Lua 5.0. While still largely relevant for later versions, there are some differences.
The fourth edition targets Lua 5.3 and is available at Amazon and other bookstores.
By buying the book, you also help to support the Lua project.


29.2 – An XML Parser

Now we will look at a simplified implementation of lxp, a binding between Lua and Expat. Expat is an open source XML 1.0 parser written in C. It implements SAX, the Simple API for XML. SAX is an event-based API. That means that a SAX parser reads an XML document and, as it goes, reports to the application what it finds, through callbacks. For instance, if we instruct Expat to parse a string like

    <tag cap="5">hi</tag>
it will generate three events: a start-element event, when it reads the substring "<tag cap="5">"; a text event (also called a character data event), when it reads "hi"; and an end-element event, when it reads "</tag>". Each of these events calls an appropriate callback handler in the application.

Here we will not cover the entire Expat library. We will concentrate only on those parts that illustrate new techniques for interacting with Lua. It is easy to add bells and whistles later, after we have implemented this core functionality. Although Expat handles more than a dozen different events, we will consider only the three events that we saw in the previous example (start elements, end elements, and text). The part of the Expat API that we need for this example is small. First, we need functions to create and destroy an Expat parser:

    #include <xmlparse.h>
    
    XML_Parser XML_ParserCreate (const char *encoding);
    void XML_ParserFree (XML_Parser p);
The argument encoding is optional; we will use NULL in our binding.

After we have a parser, we must register its callback handlers:

    XML_SetElementHandler(XML_Parser p,
                          XML_StartElementHandler start,
                          XML_EndElementHandler end);
    
    XML_SetCharacterDataHandler(XML_Parser p,
                                XML_CharacterDataHandler hndl);
The first function registers handlers for start and end elements. The second function registers handlers for text (character data, in XML parlance).

All callback handlers receive some user data as their first parameter. The start-element handler receives also the tag name and its attributes:

    typedef void (*XML_StartElementHandler)(void *uData,
                                            const char *name,
                                            const char **atts);
The attributes come as a NULL-terminated array of strings, where each pair of consecutive strings holds an attribute name and its value. The end-element handler has only one extra parameter, the tag name:
    typedef void (*XML_EndElementHandler)(void *uData,
                                          const char *name);
Finally, a text handler receives only the text as an extra parameter. This text string is not null-terminated; instead, it has an explicit length:
    typedef void
    (*XML_CharacterDataHandler)(void *uData,
                                const char *s,
                                int len);

To feed text to Expat, we use the following function:

    int XML_Parse (XML_Parser p,
                   const char *s, int len, int isFinal);
Expat receives the document to be parsed in pieces, through successive calls to XML_Parse. The last argument to XML_Parse, isFinal, informs Expat whether that piece is the last one of a document. Notice that each piece of text does not need to be zero terminated; instead, we supply an explicit length. The XML_Parse function returns zero if it detects a parse error. (Expat provides auxiliary functions to retrieve error information, but we will ignore them here, for the sake of simplicity.)

The last function we need from Expat allows us to set the user data that will be passed to the handlers:

    void XML_SetUserData (XML_Parser p, void *uData);

Now let us have a look at how we can use this library in Lua. A first approach is a direct approach: Simply export all those functions to Lua. A better approach is to adapt the functionality to Lua. For instance, because Lua is untyped, we do not need different functions to set each kind of callback. Better yet, we can avoid the callback registering functions altogether. Instead, when we create a parser, we give a callback table that contains all callback handlers, each with an appropriate key. For instance, if we only want to print a layout of a document, we could use the following callback table:

    local count = 0
    
    callbacks = {
      StartElement = function (parser, tagname)
        io.write("+ ", string.rep("  ", count), tagname, "\n")
        count = count + 1
      end,
    
      EndElement = function (parser, tagname)
        count = count - 1
        io.write("- ", string.rep("  ", count), tagname, "\n")
      end,
    }
Fed with the input "<to> <yes/> </to>", those handlers would print
    + to
    +   yes
    -   yes
    - to
With this API, we do not need functions to manipulate callbacks. We manipulate them directly in the callback table. Thus, the whole API needs only three functions: one to create parsers, one to parse a piece of text, and one to close a parser. (Actually, we will implement the last two functions as methods of parser objects.) A typical use of the API could be like this:
    p = lxp.new(callbacks)     -- create new parser
    for l in io.lines() do     -- iterate over input lines
      assert(p:parse(l))               -- parse the line
      assert(p:parse("\n"))            -- add a newline
    end
    assert(p:parse())        -- finish document
    p:close()

Now let us turn our attention to the implementation. The first decision is how to represent a parser in Lua. It is quite natural to use a userdatum, but what do we need to put inside it? At least, we must keep the actual Expat parser and the callback table. We cannot store a Lua table inside a userdatum (or inside any C structure); however, we can create a reference to the table and store the reference inside the userdatum. (Remember from Section 27.3.2 that a reference is a Lua-generated integer key in the registry.) Finally, we must be able to store a Lua state into a parser object, because these parser objects is all that an Expat callback receives from our program, and the callbacks need to call Lua. Therefore, the definition for a parser object is as follows:

    #include <xmlparse.h>
    
    typedef struct lxp_userdata {
      lua_State *L;
      XML_Parser *parser;          /* associated expat parser */
      int tableref;   /* table with callbacks for this parser */
    } lxp_userdata;

The next step is the function that creates parser objects. Here it is:

    static int lxp_make_parser (lua_State *L) {
      XML_Parser p;
      lxp_userdata *xpu;
    
      /* (1) create a parser object */
      xpu = (lxp_userdata *)lua_newuserdata(L,
                                       sizeof(lxp_userdata));
    
      /* pre-initialize it, in case of errors */
      xpu->tableref = LUA_REFNIL;
      xpu->parser = NULL;
    
      /* set its metatable */
      luaL_getmetatable(L, "Expat");
      lua_setmetatable(L, -2);
    
      /* (2) create the Expat parser */
      p = xpu->parser = XML_ParserCreate(NULL);
      if (!p)
        luaL_error(L, "XML_ParserCreate failed");
    
      /* (3) create and store reference to callback table */
      luaL_checktype(L, 1, LUA_TTABLE);
      lua_pushvalue(L, 1);  /* put table on the stack top */
      xpu->tableref = luaL_ref(L, LUA_REGISTRYINDEX);
    
      /* (4) configure Expat parser */
      XML_SetUserData(p, xpu);
      XML_SetElementHandler(p, f_StartElement, f_EndElement);
      XML_SetCharacterDataHandler(p, f_CharData);
      return 1;
    }
The lxp_make_parser function has four main steps:

The next step is the parse method, which parses a piece of XML data. It gets two arguments: The parser object (the self of the method) and an optional piece of XML data. When called without any data, it informs Expat that the document has no more parts:

    static int lxp_parse (lua_State *L) {
      int status;
      size_t len;
      const char *s;
      lxp_userdata *xpu;
    
      /* get and check first argument (should be a parser) */
      xpu = (lxp_userdata *)luaL_checkudata(L, 1, "Expat");
      luaL_argcheck(L, xpu, 1, "expat parser expected");
    
      /* get second argument (a string) */
      s = luaL_optlstring(L, 2, NULL, &len);
    
      /* prepare environment for handlers: */
      /* put callback table at stack index 3 */
      lua_settop(L, 2);
      lua_getref(L, xpu->tableref);
      xpu->L = L;  /* set Lua state */
    
      /* call Expat to parse string */
      status = XML_Parse(xpu->parser, s, (int)len, s == NULL);
    
      /* return error code */
      lua_pushboolean(L, status);
      return 1;
    }
When lxp_parse calls XML_Parse, the latter function will call the handlers for each relevant element that it finds in the given piece of document. Therefore, lxp_parse first prepares an environment for these handlers. There is one more detail in the call to XML_Parse: Remember that the last argument to this function tells Expat whether the given piece of text is the last one. When we call parse without an argument s will be NULL, so this last argument will be true.

Now let us turn our attention to the callback functions f_StartElement, f_EndElement, and f_CharData. All those three functions have a similar structure: Each checks whether the callback table defines a Lua handler for its specific event and, if so, prepares the arguments and then calls that Lua handler.

Let us first see the f_CharData handler. Its code is quite simple. It calls its corresponding handler in Lua (when present) with only two arguments: the parser and the character data (a string):

    static void f_CharData (void *ud, const char *s, int len) {
      lxp_userdata *xpu = (lxp_userdata *)ud;
      lua_State *L = xpu->L;
    
      /* get handler */
      lua_pushstring(L, "CharacterData");
      lua_gettable(L, 3);
      if (lua_isnil(L, -1)) {  /* no handler? */
        lua_pop(L, 1);
        return;
      }
    
      lua_pushvalue(L, 1);  /* push the parser (`self') */
      lua_pushlstring(L, s, len);  /* push Char data */
      lua_call(L, 2, 0);  /* call the handler */
    }
Notice that all these C handlers receive a lxp_userdata structure as their first argument, due to our call to XML_SetUserData when we create the parser. Also notice how it uses the environment set by lxp_parse. First, it assumes that the callback table is at stack index 3. Second, it assumes that the parser itself is at stack index 1 (it must be there, because it should be the first argument to lxp_parse).

The f_EndElement handler is also simple and quite similar to f_CharData. It also calls its corresponding Lua handler with two arguments: the parser and the tag name (again a string, but now null-terminated):

    static void f_EndElement (void *ud, const char *name) {
      lxp_userdata *xpu = (lxp_userdata *)ud;
      lua_State *L = xpu->L;
    
      lua_pushstring(L, "EndElement");
      lua_gettable(L, 3);
      if (lua_isnil(L, -1)) {  /* no handler? */
        lua_pop(L, 1);
        return;
      }
    
      lua_pushvalue(L, 1);  /* push the parser (`self') */
      lua_pushstring(L, name);  /* push tag name */
      lua_call(L, 2, 0);  /* call the handler */
    }

The last handler, f_StartElement, calls Lua with three arguments: the parser, the tag name, and a list of attributes. This handler is a little more complex than the others, because it needs to translate the tag's list of attributes into Lua. We will use a quite natural translation. For instance, a start tag like

    <to method="post" priority="high">
generates the following table of attributes:
    { method = "post", priority = "high" }
The implementation of f_StartElement follows:
    static void f_StartElement (void *ud,
                                const char *name,
                                const char **atts) {
      lxp_userdata *xpu = (lxp_userdata *)ud;
      lua_State *L = xpu->L;
    
      lua_pushstring(L, "StartElement");
      lua_gettable(L, 3);
      if (lua_isnil(L, -1)) {  /* no handler? */
        lua_pop(L, 1);
        return;
      }
    
      lua_pushvalue(L, 1);  /* push the parser (`self') */
      lua_pushstring(L, name);  /* push tag name */
    
      /* create and fill the attribute table */
      lua_newtable(L);
      while (*atts) {
        lua_pushstring(L, *atts++);
        lua_pushstring(L, *atts++);
        lua_settable(L, -3);
      }
    
      lua_call(L, 3, 0);  /* call the handler */
    }

The last method for parsers is close. When we close a parser, we have to free all its resources, namely the Expat structure and the callback table. Remember that, due to occasional errors during its creation, a parser may not have these resources:

    static int lxp_close (lua_State *L) {
      lxp_userdata *xpu;
    
      xpu = (lxp_userdata *)luaL_checkudata(L, 1, "Expat");
      luaL_argcheck(L, xpu, 1, "expat parser expected");
    
      /* free (unref) callback table */
      luaL_unref(L, LUA_REGISTRYINDEX, xpu->tableref);
      xpu->tableref = LUA_REFNIL;
    
      /* free Expat parser (if there is one) */
      if (xpu->parser)
        XML_ParserFree(xpu->parser);
      xpu->parser = NULL;
      return 0;
    }
Notice how we keep the parser in a consistent state as we close it, so there is no problem if we try to close it again or when the garbage collector finalizes it. Actually, we will use exactly this function as the finalizer. That ensures that every parser eventually frees its resources, even if the programmer does not close it.

The final step is to open the library, putting all those parts together. We will use here the same scheme that we used in the object-oriented array example (Section 28.3): We will create a metatable, put all methods inside it, and make its __index field point to itself. For that, we need a list with the parser methods:

    static const struct luaL_reg lxp_meths[] = {
      {"parse", lxp_parse},
      {"close", lxp_close},
      {"__gc", lxp_close},
      {NULL, NULL}
    };
We also need a list with the functions of this library. As is common with OO libraries, this library has a single function, which creates new parsers:
    static const struct luaL_reg lxp_funcs[] = {
      {"new", lxp_make_parser},
      {NULL, NULL}
    };
Finally, the open function must create the metatable, make it point to itself (through __index), and register methods and functions:
    int luaopen_lxp (lua_State *L) {
      /* create metatable */
      luaL_newmetatable(L, "Expat");
    
      /* metatable.__index = metatable */
      lua_pushliteral(L, "__index");
      lua_pushvalue(L, -2);
      lua_rawset(L, -3);
    
      /* register methods */
      luaL_openlib (L, NULL, lxp_meths, 0);
    
      /* register functions (only lxp.new) */
      luaL_openlib (L, "lxp", lxp_funcs, 0);
      return 1;
    }