Processing XML With Java - A Performance Benchmark - Ipp.pt

Transcription

International Journal of New Computer Architectures and their Applications (IJNCAA) 3(1): 72-85The Society of Digital Information and Wireless Communications (SDIWC) 2013 (ISSN: 2220-9085)Processing XML with Java – A Performance BenchmarkBruno Oliveira1,Vasco Santos1 and Orlando Belo21CIICESI, School of Management and Technology, Polytechnic of PortoFelgueiras, PORTUGAL{bmo,vsantos}@estgf.ipp.pt2Algoritmi R&D Centre, University of Minho4710-057 Braga, PORTUGALobelo@di.uminho.ptABSTRACTOver time, XML markup language has acquired aconsiderable importance in applications development,standards definition and in the representation of largevolumes of data, such as databases. Today, processingXML documents in a short period of time is a criticalactivity in a large range of applications, which imposeschoosing the most appropriate mechanism to parseXML documents quickly and efficiently. When using aprogramming language for XML processing, such asJava, it becomes necessary to use effectivemechanisms, e.g. APIs, which allow reading andprocessing of large documents in appropriated manners.This paper presents a performance study of the mainexisting Java APIs that deal with XML documents, inorder to identify the most suitable one for processinglarge XML files.KEYWORDSXML, XML Markup languages, XML Documents, JavaAPI, Performance Analysis.1 INTRODUCTIONDue to the simplicity of its hierarchical structure,XML (Extensible Markup Language) is widelyused for data representation in many applications.As a result of its portability, XML is used toensure data interchanging among systems withhigh heterogeneous natures, facilitating datacommunication and sharing, it’s platformindependent, which makes it quite attractive for themajority of applications. Associated with the XMLformat there are other languages that complementthe application area of this format, such as XSD,XSLT or XQuery. Currently, XML format is usedin the development of several types of software,including web pages, web services, networkapplications, and fully based XML databases.Access and modification operations are essential toXML files manipulation once they are affected byany increasing amount of data, by the complexityof those operations, and by shorter periods of timeneeded to process them. Coupled with this datagrowing, XML documents can reach large numberof megabytes (or even gigabytes), limiting andconditioning the technology used for developmentof applications appealing for XML dataprocessing. Also coupled with the concept ofportability, Java programming language provides aset of interfaces allowing for the manipulation ofstructured documents according to the XMLformat. Due to their portability, Java and XML arecommonly used in application development and innative XML databases for data manipulation [1].The main focus of this paper was to conduct astudy of the various parsing models and APIs(Application Programming Interfaces) for XMLprocessing using Java programming language, withthe purpose to supply a refresh benchmark to theavailable representation models, identifying whichis the most suitable for access and transformationof large XML documents. We also refer the mainadvantages identified for each representationmodel, always keeping the performance factor inmind. In the next section we will examine someinteresting related work about studies andevaluations between several Java APIs across time.Later, in section 3, we will discuss some otheroperational characteristics for memory andstreaming representation models, identifying howdocuments are processed according to each parsingmodel. Section 4 and section 5, respectively,72

International Journal of New Computer Architectures and their Applications (IJNCAA) 3(1): 72-85The Society of Digital Information and Wireless Communications (SDIWC) 2013 (ISSN: 2220-9085)presents some memory-based APIs and streamingbased APIs and their main features. After, insection 6, a brief comparison between performanceand memory consumption of memory-based APIsand streaming-based APIs will be done. We usedspecific XML instances with different sizes and wetested selected APIs (memory and streamingbased) for execution time and memoryconsumption. We also developed a specific unaryand binary transformation operations, and testedthem for execution time using best memory andstreaming APIs selected from previous tests. Next,in section 7, we compare modificationperformance of the best memory-based APIsstudied previously, exploring some configurationsin each of them that influences execution time andmemory consumption. We finish the paper insection 8 summarizing results and presentingconclusions.2 RELATED WORKIn [2] the process of handling XML documentswas described in four phases: Parsing, that isconsidered a critical step in performance, Access,Modification and Serialization (figure 1), whoseperformance is directly affected by the parsingmodels.Figure 1. Example of a XML memory tree representationAs the most critical factor of performance, parsingis characterized by the conversion of characters,mainly related to the conversion of characters intoa format that a programming languageunderstands, lexical analysis which is the processthat identifies XML elements, e.g. start node, endnode or characters, applying regular expressionsdefined by World Wide Web Consortium (W3C)1.The last step of the parsing phase is the syntacticanalysis of the document, where it is checked if thedocument complies with the rules of construction1http://www.w3.org/of an XML document. Finally, the API implementsaccess and modification operations on the dataresulted from the parsing process.Due to its complexity and importance, the parsingprocess is the most critical operation in XMLprocessing, directly conditioning processing timeand memory consumption. Several studies [3–9]have been conducted with the goal to test, improverepresentation models and APIs in XMLprocessing [10]. As Java and other technologiesevolve, it is necessary to review the newapproaches and improvements provided by severalXML parsers available.In 2001 Sosnoski condutes [11] a detailed studywith the main parsers that existed at the time. Theauthor tested DOM2, JDOM3, dom4j4, ElectricXML (no longer supported), and XML Pull Parser- XPP (no longer supported), using small files withdiverse data structures. The benchmark consists indocument build time (construct XML documentbased on text file), document navigation, modifytime, output XML document representations astext documents, amount of memory needed fordocument representation, execution time andoutput document size for Java serialization step.Later in 2002, Oren [5] proposes Piccolo XMLparser presenting a comparative study betweenparsers, which implements SAX (Simple API forXML Processing)5 interfaces. Although outdated,these study provided interesting guidelines relatedto the test methodology and conclusions about theoverall best API, which changes in subsequentstudies [6] for similar tests. Another interestingstudy was realized by Perksins et al. [9], whereauthors use a small (less than 1 KB) XMLrepresenting a typical purchase order structure totest transcoding impact and object creation ofDOM, SAX and JAX-RPC. The authors alsoexplore the navigation costs of each API andcompare the results with a specific XPath parser.In [6], authors provide a detailed study aboutperformance of VTD (Virtual Token Descriptor6)(with and without buffer reuse), SAX (Piccolo and2http://download.oracle.com/javase/6/docs/ ect.org6http://vtd-xml.sourceforge.net373

International Journal of New Computer Architectures and their Applications (IJNCAA) 3(1): 72-85The Society of Digital Information and Wireless Communications (SDIWC) 2013 (ISSN: 2220-9085)Xerces implementation), XML Pull Parser andDOM (with and without deferred node expansion).In order to provide a benchmark of each one ofAPIs tested the authors used a set of XMLexample files, which represents typical real-worldapplications. These files have several sizescategorized as Small (between 1,6 6,8 KB),Medium (10 1 MB) and Big (between 1 15MB). Tests were conducted with files in memory(same as [5]), with the purpose of reducing I/Ocosts. XML parsing performance was conductedfor testing latency, memory usage and navigationperformance. Further, Haw and Rao [3] provided acomparative study and benchmarking betweenSAX, StAX (Streaming API for XML7), DOM andElectric XML, proposing a new SAXimplementation called xParse. In that work,authors compared SAX and DOM for Xerces Javaand .NET implementations using specificoperations based on small XML files.More recently, VTD website [12] conducted abenchmark between Xerces DOM (with deferedand non-defered mode), SAX, Piccolo, XML PullParser (XPP3) and VTD, showing the globalsuperiority of VTD. Authors use fourbenchmarking processes, the first one, tests VTDand DOM for indexing-related performance usinga XML data structure from a typical sellingapplication with sizes between 6 KB and 9 MB.The tests apply specific XPath expressions to thesefiles in order to test a variety of scenarios based onfilter and select operations. Initially, XML indexfiles are loaded into memory, in order to remove aspecific node from the result set, generated by theapplication of XPath expressions. Consequently,the result sets are written to the output document.Next, authors test the parsing process, XPathevaluation and XML modification, using the samefiles and XPATH expressions for VTD (withbuffer and without buffer reuse), and DOM. Forthis benchmark, XML files are loaded intomemory and after the parsing of the document,XPath expressions are evaluated, a specific node isremoved from the result set that is then written toan output document. The third test comparesperformance of XPath expressions for a large7number of iterations for VTD, Jaxen and Xalan.Finally,thelasttestcomparesVTDimplementation in Java (with and without bufferreuse) and C language to SAX, DOM, Piccolo andXPP3 for performance and memory usage. Forthat, authors use diverse XML data structure fileswith a size between 1 KB and 26 MB(approximately). The overall results show a clearsuperiority of VTD in relation to other approaches.This last test is the most interesting for us, sincewe will focus on the similar topics in this article.However being very detailed, the benchmark fromVTD website did not focus in all topics that wewant to test (e.g. big files with more than 1GB),and some of the other benchmarks already focusedwere outdated or did no use Java programminglanguage. This is mainly caused by miscellaneousupdates and improvements in the executionenvironment, particularly in the Java VirtualMachine, which affects, as we know, runtime andeffectiveness of the operations.3 MEMORY AND STREAMING-BASEDREPRESENTATION MODELSMost memory-based APIs use a common model indata processing, where XML documents areentirely stored in memory in a tree format withmultiple nodes, descending all from a single noderepresenting the root of the tree. This kind ofschema allows the use of different methods tolocate and manipulate data contained inside thenodes. Using memory-based models implies thatthe parser partially or totally allocates memory fordata tree (figure 2) from specific XML file,making data ready for using in navigation methodsin order to process required data.Figure 2. Parsing step for memory-based modelshttp://stax.codehaus.org/Home74

International Journal of New Computer Architectures and their Applications (IJNCAA) 3(1): 72-85The Society of Digital Information and Wireless Communications (SDIWC) 2013 (ISSN: 2220-9085)Figure 3. Example of a XML memory tree representationFor each search or other kind of manipulation, it isnecessary to start processing by the root elementcontinuing in the structure hierarchy to access theremaining data (figure 3). Since all the informationis available in memory, we can traverse the tree inrandom order, changing the positioning of thenodes and performing data transformations in avery simple and accessible way. Considering itsmemory structure representation, these APIsfacilitate the process of application development,providing a wide range of search methods thatallow you to easily perform operations on theconstituent nodes of the tree. However memorybased APIs consume, in average, four to five timesmore memory than the document’s size. Forexample, a 20 megabytes document needs,depending on the representation model,approximately 100 megabytes in order to be storedin memory, which may represent a problem inprocessing large documents.Streaming-based APIs perform a sequential scan ofthe document using minimum memory resources.Typically, this type of APIs use the depth of theXML document (number of nested elements) andthe maximum data stored in XML attributes on asingle XML element. Both of these are alwayssmaller than the size of the memory-based parsingtree approach. Then, a small portion of thedocument is extracted sequentially without theneed to load the whole document structure.Usually, the parser reads the XML documentcalling a specific method for each type of event toprocess its object. Figure 4 presents the SAXconceptual model for XML processing, which issimilar to other streaming-based APIs.The parser is configured as an input source, whichis associated with a set of content managementmethods that identify, for example, the beginningor the end of the document and elements of datathat might contain errors that occurred during theparsing step. When the parser runs, some eventtriggers are captured by content managementmethods. Each time the parser detects an importantpart of the XML document it triggers theappropriate method in order to read the respectivedata block.Conceptual model from figure 4 represents pushmodel that is used by SAX API. Basically, in pushmodel parser checks for each event type retrieveby source XML file. With this approach, the parserhandles all XML events making uninterestedevents impossible to avoid, and as consequenceaccess applications must handle all eventsprovided from parser. In other way, StAXimplements pull model, which events are handledby access applications that are responsible toinvoke specific events, avoiding non-necessaryevents (figure 5).Essentially, taking into account its operationalcharacteristics, the push model is more suitablewhen we need to read all XML file, since theparser will read all XML event tokens. Howeverwhen user applications need, for some reason, to75

International Journal of New Computer Architectures and their Applications (IJNCAA) 3(1): 72-85The Society of Digital Information and Wireless Communications (SDIWC) 2013 (ISSN: 2220-9085)skip parts of XML file, then the pull model shouldbe used [2].APIs that revise some of its characteristics, withthe aim of serving specific requirements.For instance, the JDOM API allows themanipulation of XML documents with Java via atree structure representation, thus being similar toDOM. However, this API has been developedspecifically for Java language, making it muchmore intuitive for a typical Java programmer. Forexample, there is no Text class [13], since Javaprogramming language provides its own class(String class). JDOM takes advantage of Javafeatures such as: creating methods with the samename, reflection9, weak references10, and the use ofcollections such as List and Iterator [14]. JDOMAPI differs from DOM API in the use of classesinstead of interfaces, simplifying the API butlimiting flexibility.Figure 4. SAX parsing model8Streaming-based APIs are more suitable forprocessing large XML documents, because, intheory, they can process documents of infinite size.Figure 5. Push vs pull model4 MEMORY-BASED APISIncluded in JAXP package, DOM API is acollection of classes that has a set of Java methodsthat allows XML processing in memory with astructure similar to figure 3. In several cases, theDOM API is the basis for the construction of new8Image trabalhos/sem011/t2/apis xml java/For his part, the dom4j is an open-source APIbased on DOM and JDOM concepts, using aninterface and abstract base class approach, withextensive use of the Collection classes. dom4j is amore complete solution than JDOM, which givesmore emphasis to the use of the interfaces, addingmore flexibility at the cost of a little addedcomplexity [11, 12].Inspired by DOM and JDOM, the XOM API wasdesigned to be the best of both worlds. In Harold’spresentation [16], XOM is classified as an easy touse API, fast and simple. XOM makes use ofexisting Java mechanisms (like JDOM), revealinga far more restricted API that does not allowcreation of malformed documents, forcingvalidations through the use of inheritance. In suchoverview some disadvantages of JDOM werepresented, namely the one that considers itinconsistent since there are several ways toaccomplish the same tasks (like reading a childelement) and due to some gaps in the use of Javaconvention (e.g. set methods not always returnvoid).Another disadvantage listed, refers to elements ofan XML document that are represented usingobjects, which produces small memory overheads.In addition, a comparison is also provided with /05/04/understanding-weakreferences1076

International Journal of New Computer Architectures and their Applications (IJNCAA) 3(1): 72-85The Society of Digital Information and Wireless Communications (SDIWC) 2013 (ISSN: 2220-9085)dom4j that uses interfaces instead of classesresulting in a more complex API. Briefly, we cansay that dom4j is an API based on DOM (andextended), and the XOM API based on theprinciples of DOM with the main goal ofsimplifying XML processing. JDOM, dom4j andXOM have the advantage of being specificallydeveloped for the Java language, unlike other APIs(like DOM), which were developed in a genericway for several programming languages [11].XQuery is a language for extracting data from anXML document that allows the creation of a highlevel code for extraction of data, similar to whathappens with SQL language for relationaldatabases. This language will require nativesupport from the API that should interpretcommands produced from XQuery language.OJXQI (Oracle Java XQuery API) is an APIproposed by Oracle which is incorporated into itsdatabase with support for XQuery language,simplifying XML transformations through the useof a simple language, which is very similar inconstruction to SQL language.Oracle supports XQuery in two different levels:database and mid-tier. The first one applies queriesin the database environment and the second onerun queries on sources, which are not databases.Thus, it is possible to compile several clausesallowing XQuery execution, and consequently leadto a new set of results. Data from OJXQI API isentirely processed in memory, allowing thecreation of DOM objects in order to represent thedata.The last API that was analyzed, representing XMLdata in an object tree structure, is namedXerces211, and consists in a set of parsers that useDOM and SAX data models. We tested the DOMimplementation, which naturally follows the sameguidelines in terms of architecture as the previousAPIs presented.On the other hand, VTD (Virtual TokenDescriptor) API uses a different approach, havingthe premise that the creation of objects is the mainfactor of low performance. VTD API implementsarrays of integers based structure to represent datain memory, eliminating the cost of object creationresulting from the extraction process, through theuse of arrays of 64-bit integers called VTD records(figure 6).Figure 6. Representation of a VTD record12A VTD record is a binary encoding format thatspecifies how to assign tokens (identification codescomposed by length, offset, nesting depth and typeof XML tokens) in a non-extractive method. Theconcept of parsing "non-extractively" [12] meansthat XML text remains intact in memory while thetokens are represented exclusively by using rangesand sizes in bits (the contents of the string is notcopied) [2]. The process contrasts with the methodused by other extractive XML processing models(such as DOM and SAX), which allocate blocks ofmemory for document contents allocation,manipulating data directly. This manipulation canonly be performed after the parsing process hasfinished with document size as the largestbottleneck in XML data access performance.5 STREAMING-BASED APISStreaming-based APIs do not maintain long-livedstructures of documents in memory. This type ofAPIs read data as a series of events representingthem in a form of objects (like the DOM API),using a small portion of memory to process thedocument in a sequential way. Objects areassociated with different types of events and arenot maintained too long in memory unlike theapproach of memory-based APIs.The JSR (Java Specification Request) 17313defines Streaming API for XML (StAX14), thatallow parsing elements in streaming mode, and the12Figure extracted from sr/detail?id s.apache.org/xerces2-j/index.html77

International Journal of New Computer Architectures and their Applications (IJNCAA) 3(1): 72-85The Society of Digital Information and Wireless Communications (SDIWC) 2013 (ISSN: 2220-9085)extraction of information through events controlledby the application (pull model), differing fromSAX API of JAXP package, that has a managerthat takes events as convenience of the parser(push model). While StAX API allows you todiscard information in the document’s parsing asappropriate (invoking the nextEvent method), SAXparser extracts all elements even if you don’t needthem.In addition, StAX has two integrated APIs withdifferent levels of abstraction: the cursed-basedAPI, which is a lower-level API, focused onefficiency and simplicity of use, that works like astream of events, and the iterator based API thatoffers a higher level of abstraction allowingpipelining, and representing the events throughobjects. This implementation allows theprogrammer to ask (peek() method) withoutreading the event.It is possible to skip the input of both the Cursorand Event approaches. In this study we focus oncursed-based API because it is the most efficientway to read XML data [17]. In addition to SAXand StAX, we also tested XOM API withNodeFactory implementation. NodeFactory allowsparsing the XML document as Streaming like SAXand StAX.proceeds. Table 1 summarizes all APIs describedbefore.In order to test memory usage and execution timefor each API, we used two different families ofXML documents:1)one representing sales orders of a particularcompany (SalesOrderDetail), which wastaken from the Microsoft Data Warehousesamples: Adventure Works15;2) an other generated by xmlgen16 tool whichaims to represent information about a biddingweb site, from an e-commerce17 typicalapplication.6 PERFORMANCE ANALYSES OF APISTable 2 presents the size of the documents and theproperties used on tests for each API. We usedthree instances of different sizes for eachdocument type in order to test not only the size ofin-memory representation, but also the elapsedtime of parsing each document.Table 2. Documents used on testsFileFile sizeNumber ofnodes18SalesOrderDetail19,9 MB20213SAX, StAX and XOM (streaming modeimplementation) allow access to data before theparsing process is completed.SalesOrderDetail260,8 MB121317SalesOrderDetail3145,5 MB304688AuctionWebSite111,7 MB2175Table 1. APIs analysis summaryAuctionWebSite258,0 MB10875AuctionWebSite3163,4 MB30444APIParsing ModelJAXP: SaxStreaming events: push modelJAXP: StAXStreaming events: pull modelJAXP: DOMMemory: tree objectXOMMemory: tree objectOJXQIMemory: tree objectjDOMMemory: tree objectdom4jMemory: tree objectXerces2Memory: tree objectVTDMemory: array of integersThis feature allows memory consumption toremain low because processed data, and no longerin need, might be released from memory, thuskeeping memory usage low as the parsing process6.1 Memory-based APIsThe study consisted in measurements of memoryconsumption in megabytes (MB) - (figure 7), andexecution time in milliseconds (ms) - (figure ww.xml-benchmark.org17Tests realized in 2.53 Ghz Intel Core 2 Duo, 4 GB 1067 GhzDDR3, Mac OS X 10.6.4, hard drive with 5400 RPM, 1.6.0 20 –Open JDK Runtime Environment with 455 megabytes of memoryavailable18In this particular scenario, a node represents a data record. Forexample, in the SalesOrderDetail document, one node representsone sales record.1678

International Journal of New Computer Architectures and their Applications (IJNCAA) 3(1): 72-85The Society of Digital Information and Wireless Communications (SDIWC) 2013 (ISSN: 2220-9085)used by each memory-based API for thereplication of the respective XML file.Results are based on an arithmetic average resultedfrom five executions for each API for eachdocument (without considering the time of the firstexecution).&#!"!"# %&'"()&!!"%#!"%!!"'()* ,-.*-/*0(1) " #!"'()* ,-.*-/*0(1)%"'()* ,-.*-/*0(1)&" !!"2345678*9'10* "#!"2345678*9'10*%"!"2345678*9'10*&"* ),-.()Figure 7. Memory consumption in megabytes of memorybased APIsThe results shows the gain of VTD in relation toother memory-based APIs, either in terms ofmemory usage or at runtime, showing that VTDrepresentation model of data is much superior thanother APIs representation.of a DOM document in memory is higher than theXOM and OJXQI representation. When largeXML files are used, the memory-based approach isnot feasible due to inherent memory limitations.6.2 Streaming-based APIsOnce memory consumption of streaming-basedAPIs is reduced, not representing a critical point interms of processing, we only tested parsing speedin milliseconds for each API: SAX, StAX (wasdeemed the cursor-based API) and XOM(streaming-based approach) (figure 9) for each ofthe documents presented earlier.SAX and StAX are very similar in timeconsumption, which is easily expected, since themain point that distinguishes these two APIs ishow the parser handles the events processed.Considering the entire document, the results arequite similar, nevertheless XOM has a much lowerperformance compared to other streaming-basedAPIs.12000"10000"!"#8000"ms#'%!!!"' !!!"'#!!!"'!!!!"&!!!"%!!!" !!!"#!!!"!"()* ,-./ .0 1)2*'"()* ,-./ .0 1)2*#"()* ,-./ .0 1)2*3"456789: ;(21 '"456789: ;(21 #"456789: ;(21 3" %&%#'()"#Figure 8. Execution time in milliseconds of memory-basedAPIsWith the exception of VTD, no other memorybased API was able to perform the parsing of thebiggest documents with the amount of memoryavailable on Java Virtual Machine (SalesOrderDetail3 - green bar and AuctionWebSite3 –orange bar). Noteworthy is the good performancein parsing time of DOM in relation to othermemory-based APIs. Although the Is#Figure 9. Execution time in milliseconds from streamingbased APIsAs we stated before, StAX provides two mainapproaches for XML handling: Cursor API withXMLStreamReader method and Event API withXMLEventReader. Event API differ from CursorAPI in accessibility and flexibility, howeverperformance between the two approaches are verydistinctive since Cursor API is a lower level APIthat processes XML files as a stream of events.On the other way, Event API allows the processingof XML files as a series of event objects,supporting a more abstract way to handle XML79

International Journal of New Computer Architectures and their Applications (IJNCAA) 3(1): 72-85The Society of Digital Information and Wireless Communications (SDIWC) 2013 (ISSN: r"API" StAX"iterator"API"Auc9onWebSite3"StAX#APIs#Figure 10. Execution time in milliseconds from StAX cursorand Iterator APIMemory consumption is relevant between the twoapproaches. Event API consumes practically thesame memory for AuctionWebSite instances, andfor SalesOrderDetail instances consumes at most43% more memory when compared to Cursor API.6.3 Comparative analysis of two types of APIsMemory-based APIs are widely used due to thefact that, in most cases, documents beingprocessed are small enough to fit in memory.However, in cases where memory availability islimited, or the size of the XML document to beprocessed is large, streaming-based APIs are themost suitable. Project requirements are crucial todetermine the most suitable type of API used. Theneed to apply document transformation is also aconsiderable factor for API selection, oncememory-based APIs are much more suitable forthis type of operation, while streaming-based APIsare more used for forward-only applications.In order to test API performance in documenttransformations we considered SalesOrderDetaildocuments for the following APIs: SAX, StAXand VTD. Two operations were developed for eachAPI: Selection: an operation that selects a set ofelements based on a given predicate,representing forward-only access to data.Difference: an operation that removes fromthe first document all the elements that arein common with the second document,representing a random access to data.A selection operation, based on a predicate, selectsall elements where SalesOrderID has a value of43,659, producing a new document. The differenceoperation checks if an element, immediately belowthe root node of a document R, exists in adocument S thus disregarding it and keeping itonly if he doesn’t exists if document S. For thedifferenceoperationweconsideredSalesOrderDetail for both arguments in order toproduce an empty document so we couldextensively use the algori

presents some memory-based APIs and streaming-based APIs and their main features. After, in section 6, a brief comparison between performance and memory consumption of memory-based APIs and streaming-based APIs will be done. We used specific XML instances with different sizes and we tested selected APIs (memory and streaming