Parsing is an expensive operation that can degrade XML processing performance. A survey of four representative XML parsing models-DOM, SAX, StAX, and VTD-reveals their suitability for different types of applications.
Broadly used in database and networking applications, the Extensible Markup Language is the de facto standard for the interoperable document format. As XML becomes widespread, it is critical for application developers to understand the operational and performance characteristics of XML processing. As Figure 1 shows, XML processing occurs in four stages: parsing, access, modification, and serialization. Although parsing is the most expensive operation, 1 there are no detailed studies that compare the processing steps and associated overhead costs of different parsing models, tradeoffs in accessing and modifying parsed data, and XML-based applications' access and modification requirements.Figure 1 also illustrates the three-step parsing process. The first two steps, character conversion and lexical analysis, are usually invariant among different parsing models, while the third step, syntactic analysis, creates data representations based on the parsing model used.To help developers make sensible choices for their target applications, we compared the data representations of four representative parsing models: document object model (DOM; www.w3.org/DOM), simple API for XML (SAX; www.saxproject.org), streaming API for XML (StAX; http://jcp.org/ en/jsr/detail?id=173), and virtual token descriptor (VTD; http://vtd-xml. sourceforge.net). These data representations result in different operational and performance characteristics.XML-based database and networking applications have unique requirements with respect to access and modification of parsed data. Database