Programming in Lua : 20.4

This first edition was written for Lua 5.0. While still largely relevant for later versions, there are some differences.
The fourth edition targets Lua 5.3 and is available at Amazon and other bookstores.
By buying the book, you also help to support the Lua project.

Programming in Lua

Part III. The Standard Libraries Chapter 20. The String Library

20.4 – Tricks of the Trade

Pattern matching is a powerful tool for manipulating strings. You can perform many complex operations with only a few calls to string.gsub and find. However, as with any power, you must use it carefully.

Pattern matching is not a replacement for a proper parser. For quick-and-dirty programs, you can do useful manipulations on source code, but it is hard to build a product with quality. As a good example, consider the pattern we used to match comments in a C program: '/%*.-%*/'. If your program has a string containing "/*", you will get a wrong result:

    test = [[char s[] = "a /* here";  /* a tricky string */]]
    print(string.gsub(test, "/%*.-%*/", "<COMMENT>"))
      --> char s[] = "a <COMMENT>

Strings with such contents are rare and, for your own use, that pattern will probably do its job. But you cannot sell a program with such a flaw.

Usually, pattern matching is efficient enough for Lua programs: A Pentium 333MHz (which is not a fast machine by today's standards) takes less than a tenth of a second to match all words in a text with 200K characters (30K words). But you can take precautions. You should always make the pattern as specific as possible; loose patterns are slower than specific ones. An extreme example is '(.-)%$', to get all text in a string up to the first dollar sign. If the subject string has a dollar sign, everything goes fine; but suppose that the string does not contain any dollar signs. The algorithm will first try to match the pattern starting at the first position of the string. It will go through all the string, looking for a dollar. When the string ends, the pattern fails for the first position of the string. Then, the algorithm will do the whole search again, starting at the second position of the string, only to discover that the pattern does not match there, too; and so on. This will take a quadratic time, which results in more than three hours in a Pentium 333MHz for a string with 200K characters. You can correct this problem simply by anchoring the pattern at the first position of the string, with '^(.-)%$'. The anchor tells the algorithm to stop the search if it cannot find a match at the first position. With the anchor, the pattern runs in less than a tenth of a second.

Beware also of empty patterns, that is, patterns that match the empty string. For instance, if you try to match names with a pattern like '%a*', you will find names everywhere:

    i, j = string.find(";$%  **#$hello13", "%a*")
    print(i,j)   --> 1  0

In this example, the call to string.find has correctly found an empty sequence of letters at the beginning of the string.

It never makes sense to write a pattern that begins or ends with the modifier `-´, because it will match only the empty string. This modifier always needs something around it, to anchor its expansion. Similarly, a pattern that includes '.*' is tricky, because this construction can expand much more than you intended.

Sometimes, it is useful to use Lua itself to build a pattern. As an example, let us see how we can find long lines in a text, say lines with more than 70 characters. Well, a long line is a sequence of 70 or more characters different from newline. We can match a single character different from newline with the character class '[^\n]'. Therefore, we can match a long line with a pattern that repeats 70 times the pattern for one character, followed by zero or more of those characters. Instead of writing this pattern by hand, we can create it with string.rep:

    pattern = string.rep("[^\n]", 70) .. "[^\n]*"

As another example, suppose you want to make a case-insensitive search. A way to do that is to change any letter x in the pattern for the class '[xX]', that is, a class including both the upper and the lower versions of the original letter. We can automate that conversion with a function:

    function nocase (s)
      s = string.gsub(s, "%a", function (c)
            return string.format("[%s%s]", string.lower(c),
                                           string.upper(c))
          end)
      return s
    end
    
    print(nocase("Hi there!"))
      -->  [hH][iI] [tT][hH][eE][rR][eE]!

Sometimes, you want to change every plain occurrence of s1 to s2, without regarding any character as magic. If the strings s1 and s2 are literals, you can add proper escapes to magic characters while you write the strings. But if those strings are variable values, you can use another gsub to put the escapes for you:

    s1 = string.gsub(s1, "(%W)", "%%%1")
    s2 = string.gsub(s2, "%%", "%%%%")

In the search string, we escape all non-alphanumeric characters. In the replacement string, we escape only the `%´.

Another useful technique for pattern matching is to pre-process the subject string before the real work. A simple example of the use of pre-processing is to change to upper case all quoted strings in a text, where a quoted string starts and ends with a double quote (`"´), but may contain escaped quotes ("\""):

    follows a typical string: "This is \"great\"!".

Our approach to handling such cases is to pre-process the text so as to encode the problematic sequence to something else. For instance, we could code "\"" as "\1". However, if the original text already contains a "\1", we are in trouble. An easy way to do the encoding and avoid this problem is to code all sequences "\x" as "\ddd", where ddd is the decimal representation of the character x:

    function code (s)
      return (string.gsub(s, "\\(.)", function (x)
                return string.format("\\%03d", string.byte(x))
              end))
    end

Now any sequence "\ddd" in the encoded string must have come from the coding, because any "\ddd" in the original string has been coded, too. So the decoding is an easy task:

    function decode (s)
      return (string.gsub(s, "\\(%d%d%d)", function (d)
                return "\\" .. string.char(d)
              end))
    end

Now we can complete our task. As the encoded string does not contain any escaped quote ("\""), we can search for quoted strings simply with '".-"':

    s = [[follows a typical string: "This is \"great\"!".]]
    s = code(s)
    s = string.gsub(s, '(".-")', string.upper)
    s = decode(s)
    print(s)
      --> follows a typical string: "THIS IS \"GREAT\"!".

or, in a more compact notation,

    print(decode(string.gsub(code(s), '(".-")', string.upper)))

As a more complex task, let us return to our example of a primitive format converter, which changes format commands written as \command{string} to XML style:

    <command>string</command>

But now our original format is more powerful and uses the backslash character as a general escape, so that we can represent the characters `\´, `{´, and `}´, writing "\\", "\{", and "\}". To avoid our pattern matching mixing up commands and escaped characters, we should recode those sequences in the original string. However, this time we cannot code all sequences \x, because that would code our commands (written as \command) too. Instead, we code \x only when x is not a letter:

    function code (s)
      return (string.gsub(s, '\\(%A)', function (x)
               return string.format("\\%03d", string.byte(x))
             end))
    end

The decode is like that of the previous example, but it does not include the backslashes in the final string; therefore, we can call string.char directly:

    function decode (s)
      return (string.gsub(s, '\\(%d%d%d)', string.char))
    end
    
    s = [[a \emph{command} is written as \\command\{text\}.]]
    s = code(s)
    s = string.gsub(s, "\\(%a+){(.-)}", "<%1>%2</%1>")
    print(decode(s))
      -->  a <emph>command</emph> is written as \command{text}.

Our last example here deals with Comma-Separated Values (CSV), a text format supported by many programs, such as Microsoft Excel, to represent tabular data. A CSV file represents a list of records, where each record is a list of string values written in a single line, with commas between the values. Values that contain commas must be written between double quotes; if such values also have quotes, the quotes are written as two quotes. As an example, the array

    {'a b', 'a,b', ' a,"b"c', 'hello "world"!', ''}

can be represented as

    a b,"a,b"," a,""b""c", hello "world"!,

To transform an array of strings into CSV is easy. All we have to do is to concatenate the strings with commas between them:

    function toCSV (t)
      local s = ""
      for _,p in pairs(t) do
        s = s .. "," .. escapeCSV(p)
      end
      return string.sub(s, 2)      -- remove first comma
    end

If a string has commas or quotes inside, we enclose it between quotes and escape its original quotes:

    function escapeCSV (s)
      if string.find(s, '[,"]') then
        s = '"' .. string.gsub(s, '"', '""') .. '"'
      end
      return s
    end

To break a CSV into an array is more difficult, because we must avoid mixing up the commas written between quotes with the commas that separate fields. We could try to escape the commas between quotes. However, not all quote characters act as quotes; only quote characters after a comma act as a starting quote, as long as the comma itself is acting as a comma (that is, it is not between quotes). There are too many subtleties. For instance, two quotes may represent a single quote, two quotes, or nothing:

    "hello""hello", "",""

The first field in this example is the string "hello"hello", the second field is the string " """ (that is, a space followed by two quotes), and the last field is an empty string.

We could try to use multiple gsub calls to handle all those cases, but it is easier to program this task with a more conventional approach, using an explicit loop over the fields. The main task of the loop body is to find the next comma; it also stores the field contents in a table. For each field, we explicitly test whether the field starts with a quote. If it does, we do a loop looking for the closing quote. In this loop, we use the pattern '"("?)' to find the closing quote of a field: If a quote is followed by another quote, the second quote is captured and assigned to the c variable, meaning that this is not the closing quote yet.

    function fromCSV (s)
      s = s .. ','        -- ending comma
      local t = {}        -- table to collect fields
      local fieldstart = 1
      repeat
        -- next field is quoted? (start with `"'?)
        if string.find(s, '^"', fieldstart) then
          local a, c
          local i  = fieldstart
          repeat
            -- find closing quote
            a, i, c = string.find(s, '"("?)', i+1)
          until c ~= '"'    -- quote not followed by quote?
          if not i then error('unmatched "') end
          local f = string.sub(s, fieldstart+1, i-1)
          table.insert(t, (string.gsub(f, '""', '"')))
          fieldstart = string.find(s, ',', i) + 1
        else                -- unquoted; find next comma
          local nexti = string.find(s, ',', fieldstart)
          table.insert(t, string.sub(s, fieldstart, nexti-1))
          fieldstart = nexti + 1
        end
      until fieldstart > string.len(s)
      return t
    end
    
    t = fromCSV('"hello "" hello", "",""')
    for i, s in ipairs(t) do print(i, s) end
      --> 1       hello " hello
      --> 2        ""
      --> 3