Semi-structured Data Management In The Enterprise: A Nimble, High .

Transcription

Semi-structured Data Management in the Enterprise: A Nimble, HighThroughput, and Scalable ApproachDavid Maluf, David Bell, Naveen Ashish, Chris Knight and Peter TranNASA Ames Research Center, Moffett Field, CA 94035, USADavid.A.Maluf@nasa.govAbstractIn this paper we describe an approach and systemfor managing enterprise semi-structured data that ishigh-throughput, nimble, and scalable. We present theNETMARK system, which provides for a “schemaless” way of managing semi-structured documents. Wedescribe in particular detail the unique underlyingdata storage approach and efficient query processingmechanisms given this storage system. We present anextensive benchmark evaluation of the NETMARKsystem and also compare it with related XMLmanagement systems. At the heart of the approach isthe philosophy of a focus on most common datamanagement requirements in the enterprise, and notburdening users and application developers withunnecessary complexity and formal schemas.1. IntroductionSearching, extracting, and integrating informationfrom documents, in a simple yet precise manner, is akey requirement in many enterprise-wide informationsystems applications. The term ‘documents’ hereincludes textual documents such as reports and otherdocuments in formats such as MSWord, PDF or others,spreadsheets, presentations in formats such as MSPowerpoint etc.1 Such information is typically “semistructured” in that there is some structure in thedocuments but not exactly a formal structure such asthat imposed by a database schema or an XMLDTD[20]. While there are indeed systems available formanaging semi-structured or unstructured data, such asDocuShare[6] from Xerox, products from companieslike Verity2 and Autonomy3 and research systems forsemi-structured or XML data management[9,11,15],we have designed a system that is significantly more1Indeed, 80% of the enterprise data is stored in suchunstructured or semi-structured documents, instead of indatabases, according to research firm Gartnerflexible and simple from a user perspective andscalable from an application development perspective.We describe the NETMARK system[13] that theNASA Ames Research Center4 has developed and thathas been used for several NASA5 enterprise data andproject management applications. The focus of thispaper will particularly be on the data storage and queryprocessing aspects of NETMARKA key distinguishing feature of the NETMARKapproach (vis-à-vis other semi-structured data or XMLdata management systems) is that it does not requireusers (or administrators) to have to formally define thesemantics of the data (schemas) in the documents orcollections of documents. Structure and semanticsinformation implicit in the document is exploitedinstead. This leads to a system where sophisticateddata integration and composition applications can bebuilt without high schema management overheads, in ahighly scalable manner.This paper is organized as follows. In section 2, thebasic document query and search paradigm aroundwhich the NETMARK approach is centered isdescribed. Section 3 contains the technical details ofthe approach, focusing first on document storage issueswhere a unique node structure representation fordocument storage is presented. This is followed by adescription of query processing details. Section 4provides extensive benchmarking results evaluatingthe performance of the NETMARK system followedby a comparison of NETMARK with related semistructured and XML data management systems inSection 5 and a conclusion.2. Document QueryingThe data querying capabilities in NETMARK arecentered around the notions of context and content indocuments. Context and Content are notions a.gov/home/index.html

facilitate perceiving and querying a document byvarious components (sections and sub-sections) asopposed to treating a document as a single unit. Adocument typically has an inherent structure, i.e., it canbe perceived to be comprised of various distinctsections and sub-sections based on that structure. Forinstance consider one of the monthly project reportsshown in Fig 1. It comprises of various sections suchas Report Month, Report Year and Performance Status,and sub-sections such as Technical Status, ScheduleStatus etc. A context is essentially a section or subsection within a document. So for instance for themonthly project report document as described above,the Report Year, Org, Performance Status, Tech MgmtComment, would all be contexts. The information (forinstance the text) within a context is referred to ascontent.enterprise data applications. A key capability is that ofcontext search. A context search query, such as“Context Budget Comment”6 will return the contentportion in the ‘Budget Comment’ sections (the text inthe Budget Comment section) in all the documents in adocument collection, as illustrated in Fig 2. A contextquery thus extracts the specified context (section) fromall documents and returns it to the user.Fig 2. Context Query ResultsFig 1. Monthly Report DocumentsFor instance any text (and figures or tables) in theSchedule Status context (sub-section) is the contentassociated with the Schedule Status context. Contextand content thus are associated in pairs, the contentbeing associated with a context. As another example,the contexts for this paper as an example of adocument, are sections such as the Abstract,Introduction, Document Querying, etc. The querycapabilities in Netmark, built around the notions ofcontext and content, have been developed based on ourknowledge of the most common and important queriesto documents and semi-structured data in typicalFig 3. Content Query Results6This is not the precise query syntax and we do not think itessential to use the formal and precise Netmark querysyntax here

In this URL we may also specify an XSLT stylesheetwhich specifies how the results are to be formatted andcomposed into a new document. Fig 5. provides anillustration of using XDB Query to query the data inNETMARK and then using XSLT to format theresults. XSLT transformation is done using the XalanXSLT processor [19].The basic query syntax for XDB queries is:https:// server address /xslt/xdbquery/{[context context keys ] [&content content keys ]} [&scope relative url to folder ] [&syntax {html,xml,ascii}] [&sxslt relative url to xsltfile ] Fig 4. Context Content Query ResultsUsers can also specifying content searches, which areessentially keyword searches that return all documentscontaining the specified search terms. For instance, acontent query such as “Content Ames” will return alldocuments that contain the term ‘Ames’ anywhere inthe document as shown in Fig 3. One can also.combine context and content searches, for instance aquerysuchas“Context BudgetComment&Content Ames” returns the “BudgetComment” contexts (sections) of all documents wherethe term ‘Ames’ occurs within the Budget Commentcontext (section) as shown in Fig 4.Thus, as opposed to conventional keyword searchsystems which treat a complete document as a singleunit, NETMARK facilitates searching with respect tospecified contexts in the documents. The NETMARKquery language is a language called XDB Query. XDBQuery allows for posing the context and content kindsof queries over XML documents, as illustrated above.We will not go into the query syntax details here butthe key features are that context and content searchspecifications are appended to a URL that is sent toNETMARK. An example of a formal XDB query t BudgetComment&content AmesFig 5. Formatting ResultsContext and content based queries provide powerfulprimitives for querying and integrating data from semistructured data documents. Also the notions of contextand content extend to documents such as presentations(in say Powerpoint) or Excel spreadsheets as well. Forinstance one might require the ‘Budget’ sections out ofeach presentation in a collection of powerpointpresentations and a context query for ‘Budget’ on thatcollection would facilitate that. There are otherparameters that can be specified in context or contentqueries such as controlling the maximum number ofdocuments returned, the “depth” of the result items andothers but we will not go into those details here.Indeed such query and formatting capabilities haveproved to be quite powerful and adequate for quicklydeveloping several large scale applications (entailingsignificant enterprise data extraction and integration)in the NASA domain. How NETMARK providessimple yet powerful document querying capabilities isdescribed in the following section

3. Technical DetailsIn this section the technical details of NETMARK aredescribed, in particular the storage and queryprocessing aspects. Note that NETMARK serves as asemi-structured data management system and alsoprovides (integrated) access to legacy data sources inenterprises[13]. The NETMARK system architecture isfirst presented. Then a unique approach to storingsemi-structured data documents is presented followedby a description of the query processing approach.3.1 Integrated Legacy Data and XML DataAccessThe NETMARK system architecture is outlined in Fig6. below. We will not delve into the completearchitectural details in this paper and refer the readerto[13] for those.CLIENTSNETMARKWEB INTERFACENETMARKDAEMON3.2.1 Node Structure RepresentationIn related systems, for instance XML data managementsystems that have been built on top of relationaldatabase systems [17], we have specified mechanismsfor mapping XML DTDs to relational schemas andthus storing XML data in corresponding relationaltables. Different XML DTDs are mapped to differentsets of relational tables. NETMARK uses a simplerand more flexible approach where the same (two)relational tables are used to represent and store the datain any semi-structured document. The approach isbased on a searchable node structure which is theeventual representation for documents that are storedin the system. ‘Raw’ documents (initially in any formatsuch as Word, PDF etc.) are first converted7 into XML,which describes the document as decomposed intovarious sections and sub-sections. The XMLrepresentation essentially captures the various contextsand content associated with each context in adocument. For instance the XML representation of thispaper would be as shown in Fig 7. . Context Abstract /Context Content This paper describes an /Content NETMARKEXTENSIBLE APIs Context Introduction /Context Content Searching, Querying and Integrating . /Content Context Document Querying /Context NETMARKXML Store Content A document typically has an inherent /Content SGMLPARSER . RelationalDatabaseFig 6. Netmark System ArchitectureWhat is to be noted for this paper is that the underlyingdata store in NETMARK is a relational databasesystem. The other components shown are those thatinsert new documents into the NETMARK data store(i.e., the NETMARK DAEMON and the SGMLPARSER described in a later sub-section) and theNETMARK APIs for end user clients.3.2 Data Storage in NETMARKIn the section we describe how semi-structureddocuments are stored internally in NETMARK. Theapproach has been to keep the underlyingrepresentation simple, yet expressive enough to storehierarchical relationships in documents.Fig 7. Context and Content SegmentationWe begin by assigning a node to each section or subsection in a document. A node is basically the‘container’ for a context or content in the document.Each node has an assembly of labels or attributes forthe node. The attributes for each node include thefollowing:DOCID: A unique number assigned to the document.NODEID: A unique identifier for each node.NODENAME: A descriptive name for the node7The NETMARK system includes converters whichautomatically convert documents in Word, PDF, Excel andother formats to XML. The converters have been built ontop of frameworks such as Apache Jakarta POI(http://jakarta.apache.org/poi/) and JPedal for PDF(http://www.jpedal.org/)

NODETYPE: Identifies the node type, which is one ofa small list of mutually exclusive node types.NODEDATA: The actual content of the node.PARENTROWID: Contains the ROWID of a parent ofthe node (if any).SIBLINGID: Contains the ROWID of a sibling of thenode (if any).Essentially a document gets divided into blocks ofcontext and associated content. A node serves to holdeach individual context and content block in adocument. For instance, for this paper, for theIntroduction context we would have a CONTEXT typenode (i.e., where NODETYPE CONTEXT) for thatcontext. The NODEDATA for this node would be thename of the context i.e., the string “Introduction”. Wewould have another node of the type TEXT where theNODEDATA would be all the text and figures in theIntroduction section. This content node would also bea child of its associated context node.Fig 8. Relational Tables for Node and FileInformationAbstract1Introduction2Document QueryingFig 9. Node Representation and Relationships forthis Paper as an Example DocumentThe tree in Fig 9. shows some of the nodes and theirrelationships corresponding to the node structurerepresentation of this paper. The arrows denote parentchild relationships between the nodes. We seeCONTEXT nodes such as ‘Abstract’ and‘Introduction’. TEXT nodes capturing content for anycontext are placed as children of their correspondingcontext nodes. For instance, the left child of theAbstract context node is the content node (containingthe text for the abstract section) for that context, etc.Also, each context node is a child of the contextimmediately preceding it in the document. ThePARENTROWID and SIBLINGID elements in eachnode are basically pointers that capture the parent childrelationships between the various nodes.To summarize the above, a raw semi-structured datadocument is first converted to XML where the variouscontexts and contents in the document are segregated.Each context or content is stored in a node structureand parent child relationships (including contextcontent association) are captured through pointersbetween nodes. For each document inserted intoNETMARK, document information is stored in theDOC table and various nodes are stored in the XMLtable, shown in Fig 8. Both the DOC and XML tablesshown in Fig 8. are relational tables in the NETMARKunderlying relational data store.3.2.2 ROWIDs for Node Access and TraversalEach node contains pointers to its parent and siblingnodes. Query processing (described shortly) requiresus to retrieve nodes related to a node (such asretrieving the parent or sibling of a node) in a very fastmanner. For this NETMARK exploits the features ofROWIDs, a data type in Oracle 9i[1], which is therelational data store over which NETMARK is built.ROWID is an Oracle data type that stores eitherphysical or logical addresses (row identifiers) to everyrow within the Oracle database. Physical ROWIDsstore the addresses of ordinary table records (excludingindexed-organized tables), clustered tables, indexes,table partitions and sub-partitions, index partitions andsub-partitions, while logical ROWIDs store the rowaddresses within indexed-organized tables for buildingsecondary indexes. Each Oracle table has an implicitpseudo-column called ROWID, which can be retrievedby a simple SELECT query on the particular table.Physical ROWIDs provide the fastest access to anyrecord within an Oracle table with a single read blockaccess, while logical ROWIDs provide fast access for

highly volatile tables. A ROWID is guaranteed to notchange unless the rows it references is deleted from thedatabase.The physical ROWIDs have two different formats,namely the legacy restricted and the new extendedROWID formats. The restricted ROWID format is forbackward compatibility to legacy Oracle databases,such as Oracle 7 and/or earlier releases. For example,the following displays a subset of the extendedROWIDs from a NETMARK generated schema. It is ageneralized 18-character format with 64 possibilitieseach:AAAAAA BBB CCCCCC DDDThe extended ROWIDs could be used to show how anOracle table is organized and structured; but moreimportantly, extended ROWIDs make very efficientand stable unique keys for information retrieval, whichwill be addressed in the sub-section below on queryprocessing3.2.3 Mapping from Hierarchical to RelationalAlso, object-relational mapping from XML torelational database schema models the data within theXML documents as a tree of objects that are specific tothe data in the document [14]. In this model, elementtype with attributes, content, or complex element typesare generally modeled as classes. Element types withparsed character data (PCDATA) and attributes aremodeled as scalar types. This model is then mapped tothe relational database using traditional objectrelational mapping techniques or via SQL3 objectviews. Therefore, classes are mapped to tables, scalartypes are mapped to columns, and object-valuedproperties are mapped to key pairs (both primary andforeign). This mapping model is limited since theobject tree structure is different for each set of XMLdocuments. On the other hand, the NETMARK SGMLparser models the document itself (similar to theDOM), and its object tree structure is the same for allXML documents. Thus, NETMARK is designed to beindependent of any particular XML document schemasand is termed to be “schema-less”.3.3 Query ProcessingThe NETMARK keyword-based context and contentsearch is performed by first querying text index for thesearch key. Each node returned from the index searchis then processed based on its designated uniqueROWID. The processing of the node involvestraversing up the tree structure via its parent or siblingnode until the first context is found. The context isidentified via its corresponding NODETYPE. Thecontext refers to here as a heading for a subsectionwithin a HTML or XML document, similar to the H1 and H2 header tags commonly found withinHTML pages. Thus, the context and content searchreturns a subsection of the document where thekeyword being searched for occurs. Once a particularCONTEXT is found, traversing back down the treestructure via the sibling node retrieves thecorresponding content text. The search result is thenrendered and displayed appropriately.Query processing in NETMARK differs from relatedXML data management systems that have been builtover relational database systems. The prime reason isthat the way XML data is stored (in a relationaldatabase) in NETMARK is fundamentally differentfrom the manner it is stored in other XML overrelational systems. One of the earlier efforts such as[16] describe techniquessuch as basic inlining and shared inlining, which areschemes for mapping (simplified) XML DTDs torelational schemas without loss of information. Aconcern from a query processing perspective is that ofthe cost of processing XML queries with (possiblylengthy) path expressions as such queries lead tomultiple joins being performed across the underlyingrelational tables, which is expensive. The queryprocessing performance under various alternativemapping schemes is evaluated in [16]. In a relatedsystem called XTABLES[11], which is also an XMLover relational system, a query processing schemebased on an intermediate query representation isdescribed. The XTABLES query processor attempts tomaximally harness the power of its underlyingrelational engine by pushing down most memory anddata-intensive computation to the underlying relationalengine. A user query, in XQuery, is first converted intoan intermediate XML Query Graph Modelrepresentation (XQGM). Rewrite optimizations areperformed on the XQGM and the data and memoryintensive part is pushed down to the relational engineas a single SQL query. A ‘tagger run-time module’constructs the XQuery result from the results of theSQL query and returns it to the user.In NETMARK, we are using ROWIDs (physicaladdress) to traverse between nodes. A ROWIDprovides the fastest access to a record or correspondingnode within a relational table, with a single block readaccess. Accessing a record based on its physicaladdress ROWID provides an efficient, constant accesstime C (machine dependent; normally in themillisecond range) that is independent of the numberof records or nodes in the database and regardless ofmaximum node depth within a node structure. Thetime to respond to a context or content query is thus

approximately proportional to log(N) (first searchtime) plus a sum of the Cs for each successive searchwhere N is the number of records or nodes.4. BenchmarkingThe NETMARK system is intended for managinglarge documents and/or large collections of documents.The question arises, as with any other datamanagement system, regarding the performance ofNETMARK. In fact performance of a couple ofdifferent aspects is of interest. We are certainlyinterested in query processing performance for variouskinds of queries. We are also interested in theperformance of the NETMARK pipeline i.e., howefficiently can NETMARK load in documents inputinto the system. For evaluating query processing,database systems have had a long standing tradition ofbenchmarking the databases against various standardbenchmarks. Benchmarks have also been proposed forXML data management systems[2]. In NETMARK ourprime focus has been on queries centered aroundcontext and content, although NETMARK doessupport full fledged Xpath queries over documents aswell. For evaluating query processing, we thus focusedon evaluating the context and content kinds of queriesin NETMARK. We have also compared NETMARKwith another XML data management system –Berkeley XML8. Also, we have evaluated theperformance of the NETMARK pipeline. All theevaluation results are presented below.4.1 Benchmarking Query ProcessingWe present below results of benchmarking theperformance of NETMARK, including a comparisonwith Berkeley XML, another industry XML datamanagement system. We tested a variety of queriescentered around context and content. The benchmarkswere conducted on a Dell Poweredge 1650 server with2 Intel Pentium III 1.2 GHz processors, with 1.3G ofRAM and running RedHat Enterprise LINUX ASRelease 3.4.1.1 Benchmarking DocumentA generated XML document with the DTD shown inFig 10. was used. We have evaluated the queryprocessing performance of both NETMARK andBerkeley XML for various document sizes, namely100 MB, 50 MB, 25 MB, and 10 MB. A variety ofcontext and content oriented queries for the testdocument(s) were tested. For context andcontext content queries we were able to lequivalent queries (in XQuery) in Berkeley XML. Forcontent queries, even with the XQuery “contains”operator, we cannot pose queries in Berkeley XMLwith semantics equivalent to the NETMARK/XDBcontent query.

Test document size 50 MB ?xml version "1.0"? !DOCTYPE products [ !ELEMENT products(product) productU (productZ) !ELEMENTproduct(item,category,vendor, vendor 2, vendor 3) !ELEMENT productU (itemU, category,vendor, vendor 2, vendor 3) !ELEMENT productZ (itemZ, category,vendor, vendor 2, vendor 3) !ELEMENT item (#PCDATA) !ELEMENT itemU (#PCDATA) !ELEMENT itemZ (#PCDATA) !ELEMENT category (#PCDATA) !ELEMENT vendor(#PCDATA) !ELEMENT vendor 2 (#PCDATA) !ELEMENT vendor 3 (#PCDATA) ] Fig 10. DTD of Benchmarking DataDocument4.1.2 ResultsBenchmarking with single XML documentTest document size 100 MBQueriesNETMARKContext Query6990 †context itemZContext Query7126context productZContext Query234context productUContext Query81context itemUContext SizeContext 00851[399]4970002190‡[32778688]context itemZContext Querycontext productZContext Querycontext productUContext Querycontext itemUContext ContentQuerycontext item&content Lemon GrassContext ContentQuerycontext category&content FruitsContent Query 1Content Query 2ResultSize4,000 *[441362]4,000[9715210]1[1561]1[540]1[482]context item&content Lemon GrassContext ContentQueryQueries[674]Test document size 25 MBQueriesNETMARKBerkeleyXMLResultSizeContext 114,239context itemZContext Querycontext productZContext Querycontext productUContext Querycontext itemU7201[399]context category&content FruitsContent Query 1685000[65557376]Content Query 22420[672]* Number on first line denotes size in number of XMLelements. Number in [] denotes size in characters.† All times are in milliseconds (ms)Context ContentQuerycontext item&content Lemon GrassContext ContentQuery1[399]context category&content FruitsContent Query 1307000Content Query 21950‡[16389344][675]

Test document size 10 MB100 documents of 1 MB eachQueriesNETMARKBerkeleyXMLResultSizeContext Query160015998400[44134]context itemZQueriesNETMARKContext Query5620context itemZContext QueryContext Query166014346context productZContext Query15214024context productUContext Query6613,885context itemUContext ]context item&content Lemon GrassContext ContentQuery41733,0001[399]context category&content FruitsContent Query 1222000Content Query 21500‡50 documents of 1 MB eachContext Query2500context itemZContext Query2610context productZContext Query415context productUContext Query599context itemUContext 540]1[482]context item&content Lemon GrassContext ContentQuerycontext category&content Fruits643415context productUContext Query599context itemUContext [540]1[482]context item&content Lemon GrassContext ContentQuery6431[399]context category&content Fruits[617]Benchmarking with multiple XML documentsQueriesContext 0]‡ Cannot express such a query in Berkeley XMLNETMARK5630context cessingBenchmarking ResultsThe primary purpose of the above benchmarkingexercise was to provide an estimate of queryprocessing times in NETMARK per se. Additionallywe also compared it with the Berkeley XML system.The above results show that NETMARK can veryefficiently process queries for documents of largesizes, and also large numbers of documentssimultaneously. The query processing time also seemsto increase with the size of the result set returned.Context and Content queries where the result set size isnot large are processed very efficiently, often in themilliseconds range. For smaller result sets,NETMARK significantly outperforms Berkeley XMLwith most queries being processed 20-40 times faster.4.2 Netmark Pipeline BenchmarkingWe also evaluated the throughput of data insertion intoNETMARK. New documents input into theNETMARK system are first converted into XML bythe NETMARK converters and then XML documentsare loaded into NETMARK node structures inrelational tables.We measured the throughput rate of documentconversion into XML. We used two datasets for thisbenchmark, a set of 50 PDF files (of eachapproximately the same size) with a total size of 17MB, and another set of 85 MS Word documents (again

of each approximately the same size) with a total sizeof 6.5 MB. The document converter was run on aDELL Latitude machine with 1 Intel Pentium 4 CPUof 1.6 GHz and 512 MB of RAM, running WindowsXP. The time taken to convert the PDF document setwas 84 seconds, thus giving a throughput rate of 1.68sec per PDF document or 4.94 sec per MB. Conversionof the MS Word document set took a total of 8 secondsthus giving a throughput rate of 0.09 sec per documentor 1.23 sec per MB.number of relational tables in order to effectivelycapture the rich XML information. Thus even simpleXML queries often get translated into expensivesequences of joins over the underlying relational data.Efforts like TIMBER are aimed at developing anefficient direct implementation of XML. While wehave compared NETMARK with at least one XMLover relational system (Berkeley XML), it will beinteresting to compare NETMARK with a native XMLsystem from a performance perspective.5. Related Work6. ConclusionsNETMARK is related to several other systems andresearch efforts in the XML systems area which hasseen a flurry of activity in recent years. There is workon XML data management systems[9,11], XMLpublishing systems[4,5,7], XML query processingaspects[8,10,17,18], integration of XML withInformation Retrieval (IR) systems[3], etc.In this paper we have focused mostly on a newapproach in NETMARK to store semi-structured(XML) data in a relational database and the queryprocessing and performance given this storageapproach. We should then compare with otherapproaches to storage and query processing in otherXML data management systems. One approach hasbeen to use relational database systems for storing andquerying XML data, which is also what NETMARKdoes, albeit differently in a schema-less way. Suchsystems essentially work by storing XML documentsin underlying relational structures that encode theXML structure through relationships between thetables. XML documents to be stored are “shredded”into rows in these tables. XML queries are convertedto SQL queries over the underlying tables, also results(essentially relational tuples) may be converted back toXML before presenting to the user[11,14,16,17]. Wehave discussed how query processing in NETMARKdiffers from such systems in the above queryprocessing sub-section and will stress again that thefocus in NETMARK is not that of supportingcomplicated XML queries with path expressions butrather on supporting context and content oriented andhierarchical queries common in enterprise applications.An alternate direction in XML data management isaround building native XML stores, i.e., building

enterprises[13]. The NETMARK system architecture is first presented. Then a unique approach to storing semi-structured data documents is presented followed by a description of the query processing approach. 3.1 Integrated Legacy Data and XML Data Access The NETMARK system architecture is outlined in Fig 6. below. We will not delve into the complete