Programming in Lua : 20.3

This first edition was written for Lua 5.0. While still largely relevant for later versions, there are some differences.
The fourth edition targets Lua 5.3 and is available at Amazon and other bookstores.
By buying the book, you also help to support the Lua project.

Programming in Lua

Part III. The Standard Libraries Chapter 20. The String Library

20.3 – Captures

The capture mechanism allows a pattern to yank parts of the subject string that match parts of the pattern, for further use. You specify a capture by writing the parts of the pattern that you want to capture between parentheses.

When you specify captures to string.find, it returns the captured values as extra results from the call. A typical use of this facility is to break a string into parts:

    pair = "name = Anna"
    _, _, key, value = string.find(pair, "(%a+)%s*=%s*(%a+)")
    print(key, value)  --> name  Anna

The pattern '%a+' specifies a non-empty sequence of letters; the pattern '%s*' specifies a possibly empty sequence of spaces. So, in the example above, the whole pattern specifies a sequence of letters, followed by a sequence of spaces, followed by `=´, again followed by spaces plus another sequence of letters. Both sequences of letters have their patterns enclosed by parentheses, so that they will be captured if a match occurs. The find function always returns first the indices where the matching happened (which we store in the dummy variable _ in the previous example) and then the captures made during the pattern matching. Below is a similar example:

    date = "17/7/1990"
    _, _, d, m, y = string.find(date, "(%d+)/(%d+)/(%d+)")
    print(d, m, y)  --> 17  7  1990

We can also use captures in the pattern itself. In a pattern, an item like '%d', where d is a single digit, matches only a copy of the d-th capture. As a typical use, suppose you want to find, inside a string, a substring enclosed between single or double quotes. You could try a pattern such as '["'].-["']', that is, a quote followed by anything followed by another quote; but you would have problems with strings like "it's all right". To solve that problem, you can capture the first quote and use it to specify the second one:

    s = [[then he said: "it's all right"!]]
    a, b, c, quotedPart = string.find(s, "([\"'])(.-)%1")
    print(quotedPart)   --> it's all right
    print(c)            --> "

The first capture is the quote character itself and the second capture is the contents of the quote (the substring matching the '.-').

The third use of captured values is in the replacement string of gsub. Like the pattern, the replacement string may contain items like '%d', which are changed to the respective captures when the substitution is made. (By the way, because of those changes, a `%´ in the replacement string must be escaped as "%%".) As an example, the following command duplicates every letter in a string, with a hyphen between the copies:

    print(string.gsub("hello Lua!", "(%a)", "%1-%1"))
      -->  h-he-el-ll-lo-o L-Lu-ua-a!

This one interchanges adjacent characters:

    print(string.gsub("hello Lua", "(.)(.)", "%2%1"))
      -->  ehll ouLa

As a more useful example, let us write a primitive format converter, which gets a string with commands written in a LaTeX style, such as

    \command{some text}

and changes them to a format in XML style,

    <command>some text</command>

For this specification, the following line does the job:

    s = string.gsub(s, "\\(%a+){(.-)}", "<%1>%2</%1>")

For instance, if s is the string

    the \quote{task} is to \em{change} that.

that gsub call will change it to

    the <quote>task</quote> is to <em>change</em> that.

Another useful example is how to trim a string:

    function trim (s)
      return (string.gsub(s, "^%s*(.-)%s*$", "%1"))
    end

Note the judicious use of pattern formats. The two anchors (`^´ and `$´) ensure that we get the whole string. Because the '.-' tries to expand as little as possible, the two patterns '%s*' match all spaces at both extremities. Note also that, because gsub returns two values, we use extra parentheses to discard the extra result (the count).

The last use of captured values is perhaps the most powerful. We can call string.gsub with a function as its third argument, instead of a replacement string. When invoked this way, string.gsub calls the given function every time it finds a match; the arguments to this function are the captures, while the value that the function returns is used as the replacement string. As a first example, the following function does variable expansion: It substitutes the value of the global variable varname for every occurrence of $varname in a string:

    function expand (s)
      s = string.gsub(s, "$(%w+)", function (n)
            return _G[n]
          end)
      return s
    end
    
    name = "Lua"; status = "great"
    print(expand("$name is $status, isn't it?"))
      --> Lua is great, isn't it?

If you are not sure whether the given variables have string values, you can apply tostring to their values:

    function expand (s)
      return (string.gsub(s, "$(%w+)", function (n)
                return tostring(_G[n])
              end))
    end
    
    print(expand("print = $print; a = $a"))
      --> print = function: 0x8050ce0; a = nil

A more powerful example uses loadstring to evaluate whole expressions that we write in the text enclosed by square brackets preceded by a dollar sign:

    s = "sin(3) = $[math.sin(3)]; 2^5 = $[2^5]"
    
    print((string.gsub(s, "$(%b[])", function (x)
             x = "return " .. string.sub(x, 2, -2)
             local f = loadstring(x)
             return f()
           end)))
      -->  sin(3) = 0.1411200080598672; 2^5 = 32

The first match is the string "$[math.sin(3)]", whose corresponding capture is "[math.sin(3)]". The call to string.sub removes the brackets from the captured string, so the string loaded for execution will be "return math.sin(3)". The same happens for the match "$[2^5]".

Often we want a kind of string.gsub only to iterate on a string, without any interest in the resulting string. For instance, we could collect the words of a string into a table with the following code:

    words = {}
    string.gsub(s, "(%a+)", function (w)
      table.insert(words, w)
    end)

If s were the string "hello hi, again!", after that command the word table would be

    {"hello", "hi", "again"}

The string.gfind function offers a simpler way to write that code:

    words = {}
    for w in string.gfind(s, "(%a)") do
      table.insert(words, w)
    end

The gfind function fits perfectly with the generic for loop. It returns a function that iterates on all occurrences of a pattern in a string.

We can simplify that code a little bit more. When we call gfind with a pattern without any explicit capture, the function will capture the whole pattern. Therefore, we can rewrite the previous example like this:

    words = {}
    for w in string.gfind(s, "%a") do
      table.insert(words, w)
    end

For our next example, we use URL encoding, which is the encoding used by HTTP to send parameters in a URL. This encoding encodes special characters (such as `=´, `&´, and `+´) as "%XX", where XX is the hexadecimal representation of the character. Then, it changes spaces to `+´. For instance, it encodes the string "a+b = c" as "a%2Bb+%3D+c". Finally, it writes each parameter name and parameter value with an `=´ in between and appends all pairs name=value with an ampersand in-between. For instance, the values

    name = "al";  query = "a+b = c"; q="yes or no"

are encoded as

    name=al&query=a%2Bb+%3D+c&q=yes+or+no

Now, suppose we want to decode this URL and store each value in a table, indexed by its corresponding name. The following function does the basic decoding:

    function unescape (s)
      s = string.gsub(s, "+", " ")
      s = string.gsub(s, "%%(%x%x)", function (h)
            return string.char(tonumber(h, 16))
          end)
      return s
    end

The first statement changes each `+´ in the string to a space. The second gsub matches all two-digit hexadecimal numerals preceded by `%´ and calls an anonymous function. That function converts the hexadecimal numeral into a number (tonumber, with base 16) and returns the corresponding character (string.char). For instance,

    print(unescape("a%2Bb+%3D+c"))  --> a+b = c

To decode the pairs name=value we use gfind. Because both names and values cannot contain either `&´ or `=´, we can match them with the pattern '[^&=]+':

    cgi = {}
    function decode (s)
      for name, value in string.gfind(s, "([^&=]+)=([^&=]+)") do
        name = unescape(name)
        value = unescape(value)
        cgi[name] = value
      end
    end

That call to gfind matches all pairs in the form name=value and, for each pair, the iterator returns the corresponding captures (as marked by the parentheses in the matching string) as the values to name and value. The loop body simply calls unescape on both strings and stores the pair in the cgi table.

The corresponding encoding is also easy to write. First, we write the escape function; this function encodes all special characters as a `%´ followed by the character ASCII code in hexadecimal (the format option "%02X" makes an hexadecimal number with two digits, using 0 for padding), and then changes spaces to `+´:

    function escape (s)
      s = string.gsub(s, "([&=+%c])", function (c)
            return string.format("%%%02X", string.byte(c))
          end)
      s = string.gsub(s, " ", "+")
      return s
    end

The encode function traverses the table to be encoded, building the resulting string:

    function encode (t)
      local s = ""
      for k,v in pairs(t) do
        s = s .. "&" .. escape(k) .. "=" .. escape(v)
      end
      return string.sub(s, 2)     -- remove first `&'
    end
    
    t = {name = "al",  query = "a+b = c", q="yes or no"}
    print(encode(t)) --> q=yes+or+no&query=a%2Bb+%3D+c&name=al