LuaExpat
XML Expat parsing for the Lua programming language

Introduction

LuaExpat is a SAX XML parser based on the Expat library. SAX is the Simple API for XML and allows programs to:

  • process a XML document incrementally, thus being able to handle huge documents without memory penalties;
  • register handler functions which are called by the parser during the processing of the document, handling the document elements or text.

With an event-based API like SAX the XML document can be fed to the parser in chunks, and the parsing begins as soon as the parser receives the first document chunk. LuaExpat reports parsing events (such as the start and end of elements) directly to the application through callbacks. The parsing of huge documents can benefit from this piecemeal operation.

LuaExpat is distributed as a library and a file lom.lua that implements the Lua Object Model.

Building

LuaExpat could be built for Lua 5.1 to 5.4. The language library and headers files for the desired version must be installed properly. LuaExpat also depends on Expat 2.0.0+ which should also be installed.

The simplest way of building and installing LuaExpat is through LuaRocks.

LuaExpat also offers a Makefile. The file has some definitions like paths to the external libraries, compiler options and the like. One important definition is the version of Lua language, which is not obtained from the installed software.

Installation

installation can be done using LuaRocks or make install.

Manually, the compiled binary file should be copied to a directory in your C path. The Lua files ./src/lxp/*.lua should be copied to a directory in your Lua path.

Parser objects

Usually SAX implementations base all operations on the concept of a parser that allows the registration of callback functions. LuaExpat offers the same functionality but uses a different registration method, based on a table of callbacks. This table contains references to the callback functions which are responsible for the handling of the document parts. The parser will assume no behaviour for any undeclared callbacks.

Finishing parsing

Since the parser is a streaming parser, handling chunks of input at a time, the following input will parse just fine (despite the unbalanced tags);

    <one><two>some text</two>

Only when making the final call (with no data) to the parse method, the document will be closed and an error will be returned;

    assert(lxp.parse())

Closing the document is important to ensure the document being complete and valid.

Constructor

lxp.new(callbacks [, separator[, merge_character_data]])
The parser is created by a call to the function lxp.new, which returns the created parser or raises a Lua error. It receives the callbacks table and optionally the parser separator character used in the namespace expanded element names. If merge_character_data is false then LuaExpat will not combine multiple CharacterData calls into one. For more info on this behaviour see CharacterData below.

Methods

parser:close()
Closes the parser, freeing all memory used by it. A call to parser:close() without a previous call to parser:parse() could result in an error. Returns the parser object on success.
parser:getbase()
Returns the base for resolving relative URIs.
parser:getcallbacks()
Returns the callbacks table.
parser:parse(s)
Parse some more of the document. The string s contains part (or perhaps all) of the document. When called without arguments the document is closed (but the parser still has to be closed).
The function returns the parser object when the parser has been successful. If the parser finds an error it returns five results: nil, msg, line, col, and pos, which are the error message, the line number, column number and absolute position of the error in the XML document.
local cb = {}    -- table with callbacks
local doc = "<root>xml doc</root>"
lxp.new(cb):setencoding("UTF-8"):parse(doc):parse():close()
parser:pos()
Returns three results: the current parsing line, column, and absolute position.
parser:getcurrentbytecount()
Return the number of bytes of input corresponding to the current event. This function can only be called inside a handler, in other contexts it will return 0. Do not use inside a CharacterData handler unless CharacterData merging has been disabled (see lxp.new).
parser:returnnstriplet(bool)
Instructs the parser to return namespaces in triplet (true), or only duo (false). Setting this must be done before calling parse, and will only have effect if the parser was created with a separator. Returns the parser object.
parser:setbase(base)
Sets the base to be used for resolving relative URIs in system identifiers. Returns the parser object on success.
parser:setblamaxamplification(max_amp)
Sets the maximum amplification (float) to be allowed. This protects against the Billion Laughs Attack. The libexpat default is 100. Returns the parser object on success.
parser:setblathreshold(threshold)
Sets the threshold (int, in bytes) after which the protection starts. This protects against the Billion Laughs Attack. The libexpat default is 8 MiB. Returns the parser object on success.
parser:setencoding(encoding)
Set the encoding to be used by the parser. There are four built-in encodings, passed as strings: "US-ASCII", "UTF-8", "UTF-16", and "ISO-8859-1". Returns the parser object on success.
parser:stop()
Abort the parser and prevent it from parsing any further through the data it was last passed. Use to halt parsing the document when an error is discovered inside a callback, for example. The parser object cannot accept more data after this call.

Callbacks

The Lua callbacks define the handlers of the parser events. The use of a table in the parser constructor has some advantages over the registration of callbacks, since there is no need for for the API to provide a way to manipulate callbacks.

Another difference lies in the behaviour of the callbacks during the parsing itself. The callback table contains references to the functions that can be redefined at will. The only restriction is that only the callbacks present in the table at creation time will be called.