This first edition was written for Lua 5.0. While still largely relevant for later versions, there are some differences.
The third edition targets Lua 5.2 and is available at Amazon and other bookstores.
By buying the book, you also help to support the Lua project.


11.6 – String Buffers

Suppose you are building a string piecemeal, for instance reading a file line by line. Your typical code would look like this:

    -- WARNING: bad code ahead!!
    local buff = ""
    for line in io.lines() do
    buff = buff .. line .. "\n"
    end
Despite its innocent look, that code in Lua can cause a huge performance penalty for large files: For instance, it takes almost a minute to read a 350 KB file. (That is why Lua provides the io.read("*all") option, which reads the whole file in 0.02 seconds.)

Why is that? Lua uses a true garbage-collection algorithm; when it detects that the program is using too much memory, it goes through all its data structures and frees those structures that are not in use anymore (the garbage). Usually this algorithm has a good performance (it is not by chance that Lua is so fast), but the above loop takes the worst of the algorithm.

To understand what happens, let us assume that we are in the middle of the read loop; buff is already a string with 50 KB and each line has 20 bytes. When Lua concatenates buff..line.."\n", it creates a new string with 50,020 bytes and copies 50 KB from buff into this new string. That is, for each new line, Lua moves 50 KB of memory, and growing. After reading 100 new lines (only 2 KB), Lua has already moved more than 5 MB of memory. To make things worse, after the assignment

        buff = buff .. line .. "\n"
the old string is now garbage. After two loop cycles, there are two old strings making a total of more than 100 KB of garbage. So, Lua decides, quite correctly, that it is a good time to run its garbage collector and so it frees those 100 KB. The problem is that this will happen every two cycles and so Lua will run its garbage collector two thousand times before reading the whole file. Even with all this work, its memory usage will be approximately three times the file size.

This problem is not peculiar to Lua: Other languages with true garbage collection, and where strings are immutable objects, present a similar behavior, Java being the most famous example. (Java offers the structure StringBuffer to ameliorate the problem.)

Before we continue, we should remark that, despite all I said, that situation is not a common problem. For small strings, the above loop is OK. To read a whole file, we use the "*all" option, which reads it at once. However, sometimes there are no simple solutions. Then, the only solution is a more efficient algorithm. Here we show one.

Our original loop took a linear approach to the problem, concatenating small strings one by one into the accumulator. This new algorithm avoids this, using a binary approach instead. It concatenates several small strings among them and, occasionally, it concatenates the resulting large strings into larger ones. The heart of the algorithm is a stack that keeps the large strings already created in its bottom, while small strings enter through the top. The main invariant of this stack is similar to that of the popular (among programmers, at least) Tower of Hanoi: A string in the stack can never sit over a shorter string. Whenever a new string is pushed over a shorter one, then (and only then) the algorithm concatenates both. This concatenation creates a larger string, which now may be larger than its neighbor in the previous floor. If that happens, they are joined too. Those concatenations go down the stack until the loop reaches a larger string or the stack bottom.

    function newStack ()
      return {""}   -- starts with an empty string
    end
    
    function addString (stack, s)
      table.insert(stack, s)    -- push 's' into the the stack
      for i=table.getn(stack)-1, 1, -1 do
        if string.len(stack[i]) > string.len(stack[i+1]) then
          break
        end
        stack[i] = stack[i] .. table.remove(stack)
      end
    end
To get the final contents of the buffer, we just need to concatenate all strings down to the bottom. The table.concat function does exactly that: It concatenates all strings of a list.

Using this new data structure, we can rewrite our program as follows:

    local s = newStack()
    for line in io.lines() do
      addString(s, line .. "\n")
    end
    s = toString(s)
This new program reduces our original time to read a 350 KB file from 40 seconds to 0.5 seconds. The call io.read("*all") is still faster, finishing the job in 0.02 seconds.

Actually, when we call io.read("*all"), io.read uses exactly the data structure that we presented here, but implemented in C. Several other functions in the Lua libraries also use this C implementation. One of these functions is table.concat. With concat, we can simply collect all strings in a table and then concatenate all of them at once. Because concat uses the C implementation, it is efficient even for huge strings.

The concat function accepts an optional second argument, which is a separator to be inserted between the strings. Using this separator, we do not need to insert a newline after each line:

    local t = {}
    for line in io.lines() do
      table.insert(t, line)
    end
    s = table.concat(t, "\n") .. "\n"
(The io.lines iterator returns each line without the newline.) concat inserts the separator between the strings, but not after the last one, so we have to add the last newline. This last concatenation duplicates the resulting string, which can be quite big. There is no option to make concat insert this extra separator, but we can deceive it, inserting an extra empty string in t:
    table.insert(t, "")
    s = table.concat(t, "\n")
The extra newline that concat adds before this empty string is at the end of the resulting string, as we wanted.