Without Data Quality, There Is No Data Migration

Transcription

big data andcognitive computingArticleWithout Data Quality, There Is No Data MigrationOtmane Azeroual 1, *and Meena Jha 212* Citation: Azeroual, O.; Jha, M.Without Data Quality, There Is NoData Migration. Big Data Cogn.German Center for Higher Education Research and Science Studies (DZHW), 10117 Berlin, GermanyCentre for Intelligent Systems, School of Engineering and Technology, College of Information andCommunication Technology (ICT), Central Queensland University, Sydney, NSW 2000, Australia;m.jha@cqu.edu.auCorrespondence: azeroual@dzhw.eu; Tel.: 49-302064177-38Abstract: Data migration is required to run data-intensive applications. Legacy data storage systemsare not capable of accommodating the changing nature of data. In many companies, data migrationprojects fail because their importance and complexity are not taken seriously enough. Data migrationstrategies include storage migration, database migration, application migration, and business processmigration. Regardless of which migration strategy a company chooses, there should always be astronger focus on data cleansing. On the one hand, complete, correct, and clean data not only reducethe cost, complexity, and risk of the changeover, it also means a good basis for quick and strategiccompany decisions and is therefore an essential basis for today’s dynamic business processes. Dataquality is an important issue for companies looking for data migration these days and should notbe overlooked. In order to determine the relationship between data quality and data migration,an empirical study with 25 large German and Swiss companies was carried out to find out theimportance of data quality in companies for data migration. In this paper, we present our findingsregarding how data quality plays an important role in a data migration plans and must not beignored. Without acceptable data quality, data migration is impossible.Keywords: data quality; cleansing; data migration; dependency; structural equation models (SEM);business enterprise successComput. 2021, 5, 24. https://doi.org/10.3390/bdcc5020024Academic Editors: Min Chen andSabri PllanaReceived: 12 April 2021Accepted: 16 May 2021Published: 18 May 2021Publisher’s Note: MDPI stays neutralwith regard to jurisdictional claims inpublished maps and institutional affiliations.Copyright: 2021 by the authors.Licensee MDPI, Basel, Switzerland.This article is an open access articledistributed under the terms andconditions of the Creative CommonsAttribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).1. IntroductionCompanies today often use IT systems that are old and have been specially developedfor the company. These systems are called legacy systems [1] and have high operating costsor the employees lack the know-how for maintenance work, since the system is based onold programming languages and mainframes, and the documentation has been lost [2].These are often the triggers for the procurement of a modern, new system. When switchingto the new system, the operationally relevant data must be transferred to the new system.A change to a new system is associated with a migration project. Data migration movesdata from one location to another. This could be both a physical relocation and/or a logicalrelocation. They switch from one format to another or from one application to another.Usually, this happens after a new system is introduced or a new location for the data isintroduced. The business backdrop is typically application migration or consolidation,where older systems are replaced or expanded with new applications that use the samedataset. Data migrations are rampant these days, as companies move from on-premiseinfrastructures and applications to cloud-based storage and cloud-based applications tooptimize or transform their business.Data migration is an important part of digitization in companies. Whenever they introduce new software systems, they have to migrate existing content and information fromdifferent data sources. Therefore, quality assurance aims to find errors in the data, datamigration programs, and the underlying infrastructure [3]. In order for the data migrationto take place, the data must first be cleaned, and the required data quality level must beBig Data Cogn. Comput. 2021, 5, 24. i.com/journal/bdcc

Big Data Cogn. Comput. 2021, 5, 242 of 12achieved. Data cleansing finds incorrect, duplicate, inconsistent, inaccurate, incorrectlyformatted, or irrelevant data in a database and corrects it. The data cleansing processconsists of several successive individual steps or methods (such as parsing, matching, standardization, consolidation, and enrichment), some of which have to be repeated [4]. Datacleansing offers a number of advantages, such as wrong decisions due to an inadequatedatabase are avoided. Poor data quality can mean that a migration project is unsuccessful.Therefore, it is a prerequisite for the success of the data migration that measures must betaken to improve and secure the data quality.Data migration is not just a process of moving data from an old data structure ordatabase to a new one; it is also a process of correcting errors and improving overall dataquality and functionality. In this paper, the research questions related to data quality andits migration plan are investigated. The research in this paper provides new insights intothe issue of data quality in relation to data migration. The aim is to make an importantcontribution to understanding the dependency of data migration on data quality. In orderto determine the relationship between data quality and data migration, an empirical studywith 25 large German and Swiss companies was carried out to find out the importanceof data quality in companies for data migration. The companies surveyed are innovativesolution providers (software development houses) for IT software solutions based on thelatest technologies and aiming for long-term market success. The empirical study is carriedout through a quantitative analysis in the form of an online survey aimed at people whohave already worked on one or more migration projects. A structural equation model iscreated to illustrate the results. The structural equation model makes it possible to measurethe two not directly observable variables data quality and data migration. There are tworesearch questions:RQ1: How can data quality affect the success of a data migration project?RQ2: What are the factors that influence the data quality’s effect on the success of themigration project in order to be able to derive recommendations for a migration project?The remainder of the paper is organized as follows. In Section 2, we present conceptsof data migration materials, Section 3 presents data quality and its impact on data migration, and Section 4 highlights the relationships between data quality and data migration.Discussion and analysis of our survey and concluding points are given in Section 5.2. Concept of Data MigrationThis section explains a definition of data migration and discusses the requirements,goals, types, and strategies of data migration.2.1. Definition of Data MigrationThe concept of migration is complex and is derived from the Latin “migrare” and isone of the greatest concerns of the 21st century [5]. In the Information Technology (IT)area, it can denote a complete system changeover/renewal/modernization as well as anyadaptation process of individual components of a system contained therein [6]. A partialor even complete change in the logical and/or physical representation of the data in aninformation system is called data migration [7]. With data migration, two problems shouldbe addressed. “First, it is necessary to decide which database system is the target and how datacan be transferred to it, and second, how applications can be converted smoothly without affectingavailability or performance. It’s important not to forget the significant investments that have alreadybeen made in data and applications” [8,9].2.2. RequirementsThe core area of the requirements analysis in the context of a migration is to clarifywhat the target system to be developed should achieve, and this is the initial phase ofa migration project [10]. The migration of data from one system to another can havemany reasons, e.g., the introduction of new software application or a change in technology.However, the data structure of a legacy system must be aligned to the requirements of the

Big Data Cogn. Comput. 2021, 5, 243 of 12new data structure for migration to be successful. Only a successful migration guaranteesa conflict-free coexistence of old and new data. The data migration requirements can bedivided into three phases:Exporting and cleaning up old data: When exporting data, it must first be clarifiedwhich data should be reused at all. Basically, the data can be divided into two areas:master data and movement data. In order not to burden the new system unnecessarily, apoint in time is normally defined for how far back movement data should be transmitted.Everything that goes back is archived separately. Once the amount of data is staked out,the content needs to be cleaned up.Mapping of old and new data structures: Here, the data structures of the old and thenew system have to be adapted. For this purpose, each field or each value in the sourcedata is assigned to a corresponding field in the target system. For example, it is importantthat the data formats match in terms of field type (text, numeric, alphanumeric, etc.) andfield length. A migration tool that supports data synchronization and import can be usedfor this [11].Importing the data into the new system: The topic of merging the data shows howgood the preliminary work was during the import. As a rule, errors often occur in the firsttest, so that the mapping must be adjusted. An auxiliary database with the same structureas the target system offers the possibility of checking and editing data content again [12]. Itshould not be forgotten to thoroughly check and, if necessary, edit all data transferred inthe new system.2.3. Goals of Data MigrationCompanies are constantly confronted with the issue of data migration [13]. Datamigration occurs whenever software or hardware is replaced because it is out of date.According to Jha et al. [2], business processes are required to be re-engineered for theintegrating of structured data, unstructured data, and external data. The goal of datamigration could be integrating all different types of data to fulfill the changing requirementsof the organizations.There are three reasons for data migration to fulfill the different requirements ofthe organizations. The three reasons for data migrations are update migration, ongoingmigration, and replacement migration. The update migration is a migration that contributesand generates version change. Version change could be major, minor, or patch dependingon the functionality required to be added to the existing legacy system [14]. The ongoingmigration includes fundamental changes to the product and thus impacts the environment.It may also be necessary to use migration tools to transform the datasets. Some of thedata migration tools are Centerprise Data Integrator, CloverDX, and IBM InfoSphere. Thereplacement migration includes a product change or the skipping of a product generationand is associated with considerable effort, since no suitable migration tools or documentsare available. In addition to these three types, the migration can also take place in two ways,on the one hand by changing the key date, and on the other hand by gradual migration.Poor data quality is just one of the many challenges that must be overcome as thedata changes daily [15]. Data migration will sooner or later face challenges. As part ofthe migration preparation, test migrations should be carried out with real data. In thisway, generic validations and checks can identify errors in data migration at an early stageand improve them. Based on one’s own experience in a migration project, several sourcesystems are often integrated into one system. When integrating several systems into onesystem, it is necessary that a master system is defined so that duplicate data in the targetsystem can be avoided. The data from the source system must be transferred to the targetsystem. In addition, the migration of data is a complex process in which special attentionmust be paid to the data quality of the master data. Therefore, the following goals aretreated and pursued:1.To analyze and clean up the existing data and documents (by the project and thecore organization),

Big Data Cogn. Comput. 2021, 5, 244 of 122.3.Correct automated, semi-automated, and manual migrations of the relevant data anddocuments, including linking the business objects with the documents,Understand the migration and validate the results obtained. The data protectionrequirements must be observed.2.4. Types of Data MigrationThere are several types of migration that need to be considered before deciding ona migration strategy. The most complex type of migration is system migration, whichaffects the entire system. However, depending on the requirements of the migration project,it is possible to only migrate individual parts. The type of migration differs in surface,interface, program, and data migration [16]. During the program migration, all dataremain in the old environment, and only the application logic is re-implemented. Thereare three variants: change the programming language in the same environment, changethe environment in the same language, or migrate the language and environment. A puresurface migration leaves the application logic and the data in their old environment. Onlythe user interfaces are migrated. However, to do this, the user interface must be separatedfrom the program logic. If this is not the case with an old system, the separation cantake place through a renovation measure. At the same time, the program logic and theuser interfaces can also be migrated. During an interface migration, the system interfacesthat connect the system to other systems are migrated. This type of migration must becarried out whenever the external systems with which the system exchanges data change.The way in which the external system changes is unimportant, be it through migrationor a new development with new interface protocols. If the legacy system has set up adata exchange with sequential export files, the migration is more complex than if modernXML files or SOAP messages have already been set up for the data exchange [17]. This isbecause the intervention in the program code when exchanging data via sequential filesis much more complex, instead of just connecting the existing code to the new interface.During the data migration, only the data from the old system are transferred, the programsthemselves remain unchanged. If the system relied on a relational database, the changeis relatively easy and often completely automated. However, the evaluation of the datamust be examined more closely to ensure that all data have been transferred correctly.Migrating from relational structures to an object-oriented structure is complicated [18].This can only be automated to a limited extent, but many problems can be avoided here bysuitable modeling. The worst case of data migration is the migration of data based on a nonrelational, outdated database. Only the migration of the data is rarely successful in thesecases, since both the structure and the access logic of the new database are fundamentallydifferent from the old database. Therefore, the type of data migration is the most complexmigration and can be a challenge for the developer.2.5. Strategies of Data MigrationMost strategies differ in terms of the needs and goals of each organization. AsSarmah [10] said, “A well-defined data migration strategy should address the challenges ofidentifying source data, interacting with ever-changing goals, meeting data quality requirements,creating appropriate project methodologies, and developing general migration skills” [10]. Basically,there are two strategies to replace an old system, the gradual introduction or the big bangstrategy, i.e., the introduction in one step. Which of the strategies is suitable for a particularcase must be examined and defined in detail. With a big bang strategy, the old system isswitched off, the new system installed, and system parts and data are migrated withina defined period of time—often over a weekend. With a step-by-step migration, the oldsystem is migrated in several steps. Gradual migration is generally less critical than the bigbang strategy [19]. Users can slowly get used to the new features. If the new system is notyet stable, the old system can be used in an emergency.There are two types of step-by-step introduction to migration, and they are as follows:

Big Data Cogn. Comput. 2021, 5, 245 of 12 The new system offers full functionality but is only available to a limited group ofusers. New and old systems run in parallel. The group of users is expanded witheach level. The problem here is the parallel use of the old and the new system and, inparticular, the maintenance of data consistency.Another type of introduction is the provision of partial functions for all users. Theusers work in parallel on new and old systems. With each step, the functionality ofthe new system is expanded until the old system has been completely replaced.The right strategies need to be included in a migration plan under different circumstances [20]. Data are the central part of the migration. Data from the old system may needto be transformed into a new format and loaded into the database(s) of the new system.The data migration must be planned in detail. The data flow from the source databasesto the target databases is determined. In addition, all necessary data transformations aredefined. The process of migrating from a source system to the target system almost alwaysinvolves the same steps. Nevertheless, the status quo of the data quality in the sourcesystems should be recorded. To this end, it is recommended that project managers workwith the defined stakeholders to create a data quality set of rules for the business areasconcerned. The next section discusses the impact of data quality on data migration.3. Data Quality and Its Impact on Data MigrationBad data quality has different causes. This is a challenge that should not be underestimated. With master data in particular, it can happen that the data formats of the fields inthe source system and target system do not match. It can happen that the source data hasthe wrong format or is in the wrong range of values. To cope with this challenge, the sourceside must be cleaned up or a validation installed so that these constellations are cleanedup and no longer occur. For databases and information systems, a high data quality is notonly something desirable but one of the main criteria that determine whether the projectcan come about and the statements obtained from it are correct. A higher quality andusability of the data has a direct and positive effect on the decisive business results. As(English, 1999) said, “the best way to look at data quality is to examine what quality means inthe general market and then translate what quality means to data and information” [21]. Thereis no clear definition of the term data quality in the literature, which is why it is veryindividual and subjective. Therefore, according to Würthele [22], data quality is definedas a “multi-dimensional measure of the suitability of data to fulfill the purpose associated with itsacquisition/generation. This suitability can change over time as needs change” [22]. This definitionmakes it clear that the quality of data depends on the point in time at which it is viewedand on the level of demand placed on the data at that point in time [23].With the introduct

Big Data Cogn. Comput. 2021, 5, 24 2 of 12 achieved. Data cleansing finds incorrect, duplicate, inconsistent, inaccurate, incorrectly formatted, or irrele