Migration Of Relational Database To Document-Oriented Database .

Transcription

7th International Conference on Computational Intelligence, Communication Systems and Networks (CICSyN)Migration of Relational Database to Document-Oriented Database: StructureDenormalization and Data TransformationGirts Karnitis and Guntis ArnicansFaculty of ComputingUniversity of LatviaRiga, Latvia{girts.karnitis, guntis.arnicans}@lu.lv1.5 billion vehicle documents, the new system works fivetimes faster than legacy one).Various estimations exist about the future of NoSQLdatabases market, but all predict a significant increase ofshare. Company Research and Markets estimated the NoSQLmarket will be growing at a CAGR of 53.09 percent over theperiod 2013-2018 [2]. CAGR is an indicator that shows theyear-over-year growth rate of an investment over a specifiedperiod. The company Allied Market Research forecast aCAGR of 40.8% during 2014 – 2020 and the global NoSQLmarket is expected to reach 4.2 billion by 2020 [3]. Themost significant segments of applications are data storage,metadata storage, cache memory, distributed data depository,e-commerce, mobile apps, web applications, data analytics,social networking.There are five types of NoSQL databases at present:column, document, key-value, graph, multi-model. The mostimportant types for some next years are key-value stores anddocument databases. We concentrate our attention ondocument databases and call them document-orienteddatabases (several notations exit – document database,document-oriented database, document store). There is arange of document-oriented databases, for instance,MongoDB, Couchbase server, CouchDB, MarkLogic,Cloudant, Cassandra, and Clusterpoint database.Days and sometimes weeks must be spent to convert datafrom RDB to NoSQL data for the first time. Nowadays manycompanies want to test whether NoSQL based solution issuited to their needs and allocate some resources to exploretechnologies and create NoSQL based proof-of-concept.Many resources must be allocated to create data conversion,and no resources are left to create and test functionality.The problem of migrating data from relational databasesto document-oriented databases is addressed in this paper.The proposed technology or ideas can help companies tomigrate their legacy systems based on RDB to the documentoriented database. We approbated our ideas by adding newfeatures into the relational database browser DigiBrowser(http://digibrowser.com). At least companies can use theDigiBrowser for fast creating of proof-of-concept beforestarting the migration project.The remainder of the paper is organized as follows.Section 2 reveals data migration problem from legacyrelation database into a document-oriented database or otherNoSQL database. Related works are given. Section 3Abstract — Relational databases remain the leading datastorage technology. Nevertheless, many companies want toreduce operating expenses, to make scalable applications thatuse cloud computing technologies. Use of NoSQL database isone of the possible solutions, and it is forecasted that theNoSQL market will be growing at a CAGR of approximately50 percent over the next five years. The paper offers a solutionfor quick data migration from a relational database into adocument-oriented database. We have created semiautomatically two logical levels over physical data. Users canrefine generated logical data model and configure datamigration template for each needed document. Data migrationfeatures are implemented into relational database browserDigiBrowser. Real patients’ database was migrated toClasterpoint database. The offered approach provides meansto obtain at least proof-of-concept for new document-orienteddatabase solution in a couple of days.Keywords - database migration; relational database; NoSQL;document-oriented database; meta model; logical datarepresentationI.INTRODUCTIONRelational databases (RDM) have been the leading datastorage technology for years. Nevertheless, changes in thedemands for the data processing have caused emergence ofnew data storage, retrieval and processing mechanisms. Oneof such mechanisms is a NoSQL (Not only SQL) databasethat was introduced to label non-relational and distributeddata stores. To reduce operating expenses, managers of manylarge companies are looking forward to the cloud computingsolutions. These solutions help to store data efficiently, tocompute the massive amount of data, to provide highscalability, to ensure high performance and availability atlow costs. Relation databases are not appropriate for cloudcomputing, but NoSQL databases are.The market of NoSQL databases is still small but steadygrowing. More and more companies are trying to exploitNoSQL databases to improve business. The article [1]describes one such example – company CARFAX that offersonline vehicle and valuation website. CARFAX has 13.6billion records associated with 700 million vehicles, and itwas difficult to run a website that uses a relational database.The company is now running a website that is based onNoSQL database MongoDB (108 servers, 10.6 TBs of data,978-1-4673-7016-5/15 31.00 2015 IEEEDOI 10.1109/CICSyN.2015.30101113

step is to convert data from the relational database to NoSQLin compliance with the created data structures (meta-model)and conversion scope.We derived our solution from the experience of buildingbrowsers for relational databases. Paper [11] contains adescription of basic ideas and implementation of thebrowser.The browser connects to database, reads its schema, andcreates data browsing model based on obtained metadata.After that, the user can start data browsing. The userperceives data as a graph. The tool provides the possibility totraverse from one node (data item) to another and look atrelated data at any moment. In order to get details of data orto move to the other place, the user clicks on links similarlyto browsing web pages. The browser can be used by ITspecialists as well as by users without specific IT knowledge.It is allowed to modify and customize browsing model. Thus,after customization the tool is more like a simple informationsystem that allows to traverse throughout database, seeadditional calculated information, and even change data [12].explains our approach for data denormalization and creatingtemplates that define logical objects. Section 4 shows howdata is transformed according to created definitions. Resultsof approbation are in Section 5. Finally, we give conclusions.II.DATA MIGRATION PROBLEM AND SOLUTIONSData accumulated by information systems is one of theimportant assets for most of the companies. Pushed bycustomer demands and pulled by changes in technologies,companies from time to time migrate from one informationsystem to another. Hence, data from the legacy system has tobe migrated to the new system. Despite the significance ofthis process, knowledge about migration process is limited[4]. F. Matthes and C. Schulz in [4] fixed a state of the artand provided a literature review of the data migrationproblem.F. Matthes and C. Schulz define the term data migrationas follows: “Tool-supported one-time process which aims atmigrating formatted data from a source structure to a targetdata structure whereas both structures differ on a conceptualand/or physical level”.Data migration has two important steps: first,restructuring of source data according to requirements of thetarget system, and second, transferring data from the sourceto the target database. The academic literature offers severalapproaches/methods for dealing with these steps: schemaconversion, meta-modeling approach, ETL (Extract,Transform, Load), program conversion, model-drivenmigration, automated data migration.Theoretical research for RDB to XML/JSONtransformation was done when the XML format appeared,but as stated in [5] conversion standard does not exist yet.Two types of transformation approaches exist.The first type of approaches creates XML documentwithout redundancy and preserving all constraints. All datafrom RDB to XML is converted. In this approach at firstXML schema is generated from RDB, and then data istranslated. Main emphasis of authors of [6] is to preserve allRDB constraints and avoid redundancies. Different modelsare used to enrich semantic of the relational schema of anexisting database in the first step; data transformation usingthe model created is in the second step. Authors of [7] usethe so-called Canonical Data Model (CDM). Authors of [8]use extended entity relationship (EER) schema. Approachesmentioned above generate very flat XML structure. Someresearch is done to create nesting XML schemas, forexample [9].The second type of approaches is a manual creation ofconversion mapping from the existing relational database tothe existing XML schema. Tools exist to map the relationaldatabase to XML schema, for example, Altova XMLSpy.Our approach is close to the meta-modelling approach[10]. Data conversion from RDB to NoSQL consists of twosteps. The first step is to get the physical data structure fromthe existing relational database and to raise the semanticlevel of achieved data structure (creating of the so-calledmeta-model). Conversion scope is also defined. The secondIII.DATA STRUCTURE DENORNMALIZATIONWe use as an example a relational database schemashowed in Fig. 1.A. Physical level of dataLet us consider that a university application stores data ina relational database with the given data schema. Logicalobjects (business concepts [4]) used in our example areCourse, Teacher, Student, Lecture, Examination, Grade,Attendance. Data are with complex structure and tightlyconnected. For instance, a Grade not only containsinformation about the level of the grade, but also specifieswhat student has received the grade of a particular courseand what teacher has granted the grade. The object Grade is acomplex object at the logical level. On the other hand, theGrade is a table in the relational database that contains onlyessential data of the object Grade. Developer defined viewsmight exist in the relational database. The view often is somelogical object that is formed from several physical tables andcan be used as part of the target document.A part of relational database metadata used for dataconversion to NoSQL is shown in Fig. 2. A relation databasecontains Table-like-structures: Tables and Views. The Tablelike-structure has a Name. Tables and Views do not differfrom the data translation perspective. The Table-likestructure contains Attributes (columns). Each Attribute has aName and Data type. A Table has constraints. Primary Keys(PK), Unique Keys (UK) and Foreign Keys (FK) areessential for data translation because these constraints definethe relation between tables and records.Data in a database is usually normalized; therefore, onelogical level object is stored in a bunch of related databasetables. A relational model is a low-level physical model. Itshows how data is stored but does not show the naturalrelationship of data or full view of complex data objects.114102

Student CourseStudent ID*Course IDFK Course ID*1AttendanceStudent IDLecture IDAttendanceFK Student ID1*StudentStudent IDNameSurnameCourseCourse IDName11*Course TeacherCourse IDTeacher IDStart dateEnd dateFK Course ID**FK Course ID*FK Student IDLecture1FK Lecture ID 1 Lecture IDRoomDay1TimeFK Student IDCourse IDTeacher ID*Grade*Examination IDStudent IDGrading Teacher IDGrade*FK Course ID1FK Teacher IDFK Mentor ID* FK Teacher IDFK Grading Teacher ID* 1Teacher0.1Teacher ID1Name1 SurnameMentor*1Examination1Examination IDFK Examination ID* FK Responsible Teacher IDCourse IDResponsible Teacher IDDateDescriptionFigure 1. Physical data model of the example databasewhether table Course Teacher contains natural objects oronly relations between Teacher and Course. Thus, we cannotautomatically determine a semantic of distinct tables orgroup of tables.N:N type relations cannot be naturally defined in therelational model. Instead, a specific table is created for eachrelation. These tables are normal database tables with FK tothe related data tables. Sometimes additional information isstored in such table. For instance, a teacher instructs coursesfor known time periods. Tables Teacher, Course,Course Teacher contains all teaching facts (Teacher teachesCourse from Start date till End date). It is hard to tellTable-likestructure1 1.*B. First Logical Level of DataThe logical level shows how we perceive data in logicalunits or entities, but not how data is physically stored in adatabase. ER models and modeling languages are used toshow logical data model. The main differences betweenlogical and physical data models are the following: inlogical data model the names of tables and attributes arecreated in natural language; there are no specific codifiertables, but codified values are included in the main dataobjects; N:N relations are allowed; there are no surrogatekeys, each entity has a primary key that can be combinedfrom more than one attribute and can have attributes withlong data type, for example, text.We supplement the physical model with additionalinformation. That allows us to show data at the logical leveland convert data into logical documents. The metamodelcontaining information about physical and logical levels isshown in Fig. 3. Let us consider that this model describes afirst logical level of data.The Table-like-structure is the main object. That isanalogous to the Table-like-structure in physical model. TheTable-like-structure contains two types of attributes –Physical attributes, which come from a database, and Virtualattributes, which are calculated from other attributes in themoment of data processing. The objects in the raintNamePK UKFK1PK*UKFigure 2. Part of relational database metadata115103

PhysicalAttributeVirtual attributeNam eFunctionNam e in tableVis ible nam eIs vis ible1.*1FK Student IDAttribute1.*1*1*Table-likestructureNam e in DBVis ible nam eStudentGrade# i Student ID: 100002Name: ZolnowskiSurname: Paterson i Examination ID: 5000001 Student ID: 100002 Grading Teacher ID: 2001Grade: 10Atribute Pairs inLinkFK Grading Techer IDExamination# i Examination ID: 5000001 Course ID: 30006 Responsible Teacher ID: 2Date: 2008.04.07. 0:00:00Description: Home Work 1FK Responsible Techer IDFK Course IDTeacher1.*FK Mentor IDFK Mentor IDCourse1.*1FK Examination ID11From 1*To 1*LinkNam eIs VirtualTeacherTeacher# i Teacher ID: 2001Name: JohnSurname: Kennedy Mentor ID: Null# i Teacher ID: 2015Name: AnitaSurname: Coleman Mentor ID: Null# i Course ID: 30006Name: Introduction to AlgoritFigure 4. Example how business concept Grade might be formedCodificatorSimple entityComplex entitylike-structures that serves as a template draft for documentgeneration. Second, we refine the created tree by addingadditional Table-like-structures and removing excessiveTable-like-structures. At last, we choose what attributes areexcluded from the document (flag is visible is set to false forthe chosen attributes).Template draft can be generated by various algorithms. Asimple approach is demonstrated in Fig. 4. Let us choose theTLS Grade as a tls in focus. We build a graph recursivelyby adding as nodes TLSs that are on the opposite side oflinks with cardinality 1 on the opposite side. First, Gradeadds Student, Teacher, and Examination. Second,Examination adds Course and another instance of Teacher.The described approach usually collects essential data ofthe logical object. However, the target document may requiremore data. For instance, a document based on Student canrequire data about attended lectures and earned grades.Algorithm in Fig. 5 describes another approach. The ideabehind algorithm dfs selecting is traverse model graph byprinciple to add TLS as tree nodes while a path from the rootto any leave contains unique TLS. This algorithm is acombination of depth-first search in graph and backtrackingalgorithm. Unfortunately, this approach usually adds toomany TLS that are weakly related to the tls in focus.At time of writing this paper, we use the algorithm givenin Fig. 6. The idea is to exploit each link between TLSs notmore than one time, and TLS is visited (added) based on thewidth-first graph search algorithm. Any link is allowed to beused only in one direction.We search for another tree building algorithms,N:N-linkFigure 3. Part of logical model metadatahave their physical names in the relational database (Name intable), as well as logical name to be used during dataprocessing (Visible name). By default, all logical names arethe same as corresponding physical names until user changethem.Relations (Link) link together the Table-like-structureobjects. Relations contain attributes that define whichattribute names must be equal in the related Table-likestructures (like in FK-PK relationships). Physical relationsare defined in a database (Is Virtual has value false); virtualrelations are additional relations defined only in the logicalmodel (Is Virtual has value true). Both types of relations areused for navigation. The relationship is used only for oneway navigation in each particular moment. Therefore, eachrelationship can be perceived as two one-way relationships.That allows us to define relations with any cardinality (1:1,1:N, N:N).The Table-like-structure class has four subclasses:Codifier, Simple entity, Complex entity, N:N-link. Codifiercontains code, as well as coded data (there are no tables withtype Codifier in the example model; a sample is a table withcolors that contains fields id and color). A Simple entitycontains information about simple objects that are stored inone table (for instance, Course and Student). A Complexentity contains information about a complex object that isstored logically in more than one table (for instance, Grade,Lecture, Teacher, Attendance, and Examination). N:N-linktable is used to code N:N relationships in a relationaldatabase (for instance, Course Teacher, Student Course).The semi-automatic algorithm to determine Table-likestructure type exists [13]. We use this algorithm to buildlogical objects from tables in the relational database.dfs selecting(parent node, used tls)used tls.add(parent node.tls)for each constraint of parent node.tlsrelated tls get related tls(parent node.tls, constraint)child node create tree node(related tls)parent node.add child(child node)dfs selecting (child node, used tls)used tls.remove(parent node.tls)// Sample of the function callused tls create list()root node create tree node(tls in focus)dfs selecting(root node, used tls)C. Second Logical Level of DataThe next step is obtaining a second logical level of datafor each target document. The Table-like-structure (TLS) istaken as a basis for the document. Let us call this TLS astls in focus. First, we automatically create tree from Table-Figure 5. Algorithm dfs selecting for a draft of document template116104

bfs selecting(root node, used links)queue create queue()queue.enqueue(root node)dfs printing(record, parent node)if not parent node.type "N:N-link"print parent node.tls.visible nameif child node.type "Simple entity" orchild node.type "Complex entity"print "{"while not queue.is emptyparent node queue.degueue()for each link of parent node.tlsrelated tls get related tls(parent node.tls, link)if not used links.find(link)used links.add(link)child node create tree node(related tls)parent node.add child(child node)queue.enqueue(child node)for each field of recordif field.is constraintfor each link from parent node to child nodewhere child node is child of parent nodeif field.is visibleprint field.name, field.valuerelated records get related records(record, link)for each related record of related recordsdfs printing(related record, child node)elseif field.is visibleprint field.name, field.value// Sample of the function callused links create list()root node create tree node(tls in focus)bfs selecting(root node, used links)Figure 6. Algorithm bfs selecting for a draft of document templateif child node.type "Simple entity" orchild node.type "Complex entity"print "}"particularly, by using TLS types Codifier, Simple entity,Complex entity, N:N-link table.The document template refining is performed manuallyby experts. Templates can be created for any desirabletls in focus. Many templates may be defined for eachtls in focus, for instance, with few related TLSs forconcentrated documents and with many related TLSs forvery detailed documents.IV.// Sample of the function calldfs printing(record in focus, root node)Figure 7. Sample algorithm for document creationNoSQL database is an additional feature. Actually, instead ofdata browsing on the screen we write data to the file.DigiBrowser can connect to any relational databasestored in Oracle Database, Microsoft SQL Server, MySQL,PostgreSQL, and a browsing and traversing can beperformed immediately. It is possible to connect to VirtuosoUniversal Server (RDF data browsing) and XML file. lusterpoint.com) says: “We used Digibrowser todenormalize a complex hospital IT system from legacy SQLdata containing 100 tables, 500 columns and 1 millionpatients data into Clusterpoint XML database model creatinga single e-health record per patient in less than two days.DigiBrowser works for us like a charm!”. The same datamigration was performed for MongoDB. Another dataformats supported by DigiBrowser are XML and JSON.These formats allow migrating to a new system that supportsXML or JSON documents.We have measured patients’ data conversation to XMLand JSON formats. The desktop computer used formeasurement has the following characteristics: Intel i7-46002.1Ghz, RAM 8GB, SSD disk 256GB SAMSUNGMZ7TE256HMHP, MS Windows 8.1 64bit, MS SQL Server2008 with patients’ database. The total number of patientswas 1.5 million. Total database size was 4.2GB. Documentextraction templates took data from 10.5 million rows. Allconversation lasted 100 minutes (approximately 15 thousanddocuments per minute). Of course, document generationDATA TRANSFORMATIONData transformation and document generation are basedon the created template. Queries to the relational database areexecuted. All queries are generated according to templateand metamodel transformations.Final document structure and syntax depend on the targetdocument-oriented database. Various algorithms can becreated for desirable document format. As an example, weoffer a simple algorithm that exploits TLS types (Fig. 7).Essential data are TLS names, attribute names and attributevalues (field values in a relational database). Thisinformation is printed by command print. To show sublevel,we print ‘{’ for sublevel start and ‘}’ for sublevel end.Pseudo-command “for each link from parent node tochild node where child node is child of parent node” allowsto find such links between two TLS that correspond to thedocument template, and to iterate through these links.The given algorithm does not exploit link names (it ispossible to give a name to any link in the logical model).Using link names can make the document more readable. Forinstance, document Student can include Teacher data inseveral roles: a person who reads a lecture, a person who isresponsible for an examination, and a person who grades theexamination.V.RESULTSWe have implemented the ideas described in the paperinto tool DigiBrowser. Initially, DigiBrowser was designedonly for relation database browsing, and data migration to117105

[2][3][4][5]Figure 8. Source data model for patient’s document[6]speed depends on database structure, size and documenttemplate.A screenshot of medical data migration is shown inFig. 8. The left part contains document template tree. Theright part contains corresponding logical model. The bluerectangle is a tls in focus, and green rectangles are otherTLSs that belong to the document.VI.[7][8]CONCLUSIONS[9]The offered data migration from a relational databaseinto the document-oriented database can be used in practice.We included data conversion features into relational databasebrowsing tool DigiBrowser. Developers of documentoriented database Clusterpoint approbated DigiBrowser andconverted legacy patients’ database to 1.5 million patients’documents in a couple of days. IT specialists were notfamiliar with the given database and were forced to explorethe database by DigiBrowser before creating the logical datamodel and defining of converting templates.If the relational database is not huge, it is possible makemigration of the relational database on a daily basis. NoSQLdatabase can provide additional services for the legacysystem such as better information searching and presentation.Further studies have to be done to improve givenmethods. Various new algorithms might be created forrecognizing table types, defining document structure,choosing fields for migration, and forming target documents.[10][11][12][13]ACKNOWLEDGMENTThe research leading to these results has received fundingfrom the research project “Information and CommunicationTechnology Competence Center” of EU Structural funds,contract nr. L-KC-11-0003 signed between ICT CompetenceCentre and Investment and Development Agency of Latvia,Research No. 1.8 "Technology for denormalisation ofrelational data bases"REFERENCES[1]B. Butler, “The NoSQL database market is a small but increasinglyimportant segment of the database market, and it’s giving SQLdatabase vendors and users a scare”, Network World, Oct 27, torm.html118106Research and Market, “Global NoSQL Market 2014-2018”, kets.com/reports/2776270/global nosqlmarket 20142018Allied Market Research, “Global NoSQL Market (Type, Application,Verticals and Geography) Size, Global Trends, Company Profiles,Demand, Insights, Analysis, Research, Report, Opportunities,Segmentation and Forecast, 2013 - 2020”, Report, April ketF. Matthes and C. Schulz, “Towards an integrated data migrationprocess model”, Software Engineering for Business InformationSystems (sebis), 2011.B. Mohamed and A. Elalami, "The migration of data from a relationaldatabase (RDB) to an object relational (ORDB) database", Journal ofTheoretical and Applied Information Technology, 20th December2013. Vol. 58 No.2 ISSN: 1992-8645 ,E-ISSN: 1817-3195.P. Paolo and E. Quintarelli, "An algorithm for generating XMLSchemas from ER Schemas." In Proceedings of the Thirteenth ItalianSymposium on Advanced Database Systems, SEBD, 2005, pp. 192199.M. Abdelsalam, A. Ali, and N. Rossiter, "An integrated approach torelational database migration”, In the Proceeding of InternationalConference on Information and Communication Technologies-2008(IC-ICT'08), pages 01-06, Bannu, Pakistan (2008).F., Joseph, et al. "Translating relational schema with constraints intoXML schema." International Journal of Software Engineering andKnowledge Engineering 16.02, 2006, pp. 201-243.D. Lee, M. Mani, F. Chiu, and W. Chu, „NeT & CoT: translatingrelational schemas to XML schemas using semantic constraints, in:Proceedings of the 11th CIKM, 2002, pp. 282-291.M. Jeusfeld, and U. Johnen, "An executable meta model for reengineering database schemas", (P.Loucopoulos, Ed.) InternationalJournal on Cooperative Information Systems, 4(2-3), 1995, pp. 473483, Springer Berlin Heidelberg.G. Arnicans, G. Karnitis, “Prototype for Traversing and BrowsingRelated Data in a Relation Database,” in: Barzdins (ed.), ScientificPapers University of Latvia, Vol 756, Computer Science andInformation Technologies, University of Latvia, 2010, 59-74.G. Arnicans, G. Karnitis, “DigiBrowser: Tool for Relational p://digibrowser.com/download.php?id DIGIBrowser ToolForRelationaDataBrowsingAndTraversing&page documentationG. Arnicans, “Application generation for the simple database browserbased on the ER diagram,” in Janis Barzdins, editor, Databases andInformation Systems, Proceedings of the Third International BalticWorkshop, Volume 1, pages 198-209. Riga, 1998.

migration template for each needed document. Data migration features are implemented into relational database browser DigiBrowser. Real patients' database was migrated to Clasterpoint database. The offered approach provides means to obtain at least proof-of-concept for new document-oriented database solution in a couple of days.