Data Quality Problems In Data Warehousing

Transcription

2014 IJIRT Volume 1 Issue 7 ISSN: 2349-6002DATA QUALITY PROBLEMS IN DATAWAREHOUSINGDiksha Verma, Anjli Tyagi, Deepak SharmaDepartment of Information Technology, Dronacharya college of EngineeringAbstract- Data quality is critical to data warehouse andbusiness intelligence solutions. Better informed, morereliable decisions come from using the right data qualitytechnology during the process of loading a datawarehouse. It is important the data is accurate, complete,and consistent across data sources. Data warehousing isgaining in eminence as organizations become awake of thebenefits of decision oriented and business intelligenceoriented data basesOver the period of time manyresearchers have contributed to the data quality issues,but no research has collectively gathered all the causes ofdata quality problems at all the phases of datawarehousing Viz. 1) data sources, 2) data integration &data profiling, 3) Data staging and ETL, 4) datawarehouse modeling & schema design. The state-of-theart purpose of the paper is to identify the reasons for datadeficiencies, non-availability or reach ability problems atall the aforementioned stages of data warehousing and toformulate descriptive classification of these causes. Wehave identified possible set of causes of data quality issuesfrom the extensive literature review and with consultationof the data warehouse practitioners working in renownedIT giants on India. We hope this will help developers &Implementers of warehouse to examine and analyze theseissues before moving ahead for data integration and datawarehouse solutions for quality decision oriented andbusiness intelligence oriented applications.I.INTRODUCTIONData cleaning, also called data cleansing or scrubbing,deals with detecting and removing errors andinconsistencies from data in order to improve thequality of data. Data quality problems are present insingledata collections, such as files and databases, e.g., dueto misspellings during data entry, missing informationor other invalid data. When multiple data sources needto be integrated, e.g., in data warehouses, federateddatabase systems or global web-based informationsystems, the need for data cleaning increasessignificantly. This is because the sources often containredundant data in different representations. In order toprovide access to accurate and consistent data,consolidation of different data representations andeliminationof duplicate information become necessary.IJIRT 101220Implementing a data warehouse infrastructure tosupport business intelligence (BI) can be adaunting challenge, as highlighted in recent years bythe focus on how often data warehouseimplementations fail. Data integration technologiessuch Informatica PowerCenter has gone along way towards enabling organizations tosuccessfully deliver their data warehouse projectsunder budget and with greater business adoption rates.Despite this some data warehouseimplementation still fail to meet expectations becauseof a lack of attention to data quality.Moving data from its various sources into an easilyaccessible repository is only part of thechallenge in delivering a data warehouse and BI.Without paying attention to the accuracy,consistency and timeliness of the data as part of thedata integration lifecycle, BI quickly leadsto poor decision-making, increased cost and lostopportunities.According to industry analyst fi rm Gartner, more than50 percent of business intelligence andcustomer relationship management deployments willsuffer limited acceptance, if not outright failure,due to lack of attention to data quality issues. Theimpact of poor data quality is far reaching andits affects are both tangible and intangible. If dataquality problems are allowed to persist, yourexecutives grow to mistrust the information in the datawarehouse and will be reluctant to use it fordecision-making.7 Sources of Poor Data QualityIn recent years, corporate scandals, regulatory changes,and the collapse of major financial institutions havebrought much warranted attention to the quality ofenterprise information. We have seen the rise andassimilation of tools and methodologies that promise tomake data cleaner and more complete. Best practiceshave been developed and discussed in print and online.Data quality is no longer the domain of just the datawarehouse. It is accepted as an enterpriseresponsibility. If we have the tools, experiences, andbest practices, why, then, do we continue to strugglewiththeproblemofdataquality?INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY265

2014 IJIRT Volume 1 Issue 7 ISSN: 2349-6002The answer lies in the difficulty of truly understandingwhat quality data is and in quantifying the cost of baddata. It isn't always understood why or how to correctthis problem because poor data quality presents itself inso many ways. We plug one hole in our system, only tofind more problems elsewhere. If we can betterunderstand the underlying sources of quality issues,then we can develop a plan of action to address theproblem that is both proactive and strategic.Entry QualityEntry quality is probably the easiest problem toidentify but is often the most difficult to correct. Entryissues are usually caused by a person entering data intoa system. The problem may be a typo or a willfuldecision, such as providing a dummy phone number oraddress. Identifying these outliers or missing data iseasily accomplished with profiling tools or simplequeries.Each instance of a quality issue presents challenges inboth identifying where problems exist and inquantifying the extent of the problems. Quantifying theissues is important in order to determine where ourefforts should be focused first. A large number ofmissing email addresses may well be alarming butcould present little impact if there is no process or planfor communicating by email. It is imperative tounderstand the business requirements and to matchthem against the assessment of the problem at hand.Consider the following seven sources of data qualityissues.1. Entry quality: Did the information enter the systemcorrectlyattheorigin?The cost of entry problems depends on the use. If aphone number or email address is used only forinformational purposes, then the cost of its absence isprobably low. If instead, a phone number is used formarketing and driving new sales, then opportunity costmay be significant over a major percentage of records.2. Process quality: Was the integrity of theinformation maintained during processing through thesystem?Process QualityProcess quality issues usually occur systematically asdata is moved through an organization. They mayresult from a system crash, lost file, or any othertechnical occurrence that results from integratedsystems. These issues are often difficult to identify,especially if the data has made a number oftransformations on the way to its destination. Processquality can usually be remedied easily once the sourceof the problem is identified. Proper checks and qualitycontrol at each touchpoint along the path can helpensure that problems are rooted out, but these checksare often absent in legacy processes.3. Identification quality: Are two similar objectsidentified correctly to be the same or different?4. Integration quality: Is all the known informationabout an object integrated to the point of providing anaccuraterepresentationoftheobject?5. Usage quality: Is the information used andinterpreted correctly at the point of access?6. Aging quality: Has enough time passed that thevalidity of the information can no longer be trusted?7. Organizational quality: Can the same informationbe reconciled between two systems based on the waythe organization constructs and views the data?A plan of action must account for each of these sourcesof error. Each case differs in its ease of detection andease of correction. An examination of each of thesesources reveals a varying amount of costs associatedwith each and inconsistent amounts of difficulty toaddresstheproblem.IJIRT 101220Addressing data quality at the source can be difficult. Ifdata was sourced from a third party, there is usuallylittle the organization can do. Likewise, applicationsthat provide internal sources of data might be old andtoo expensive to modify. And there are few incentivesfor the clerks at the point of entry to obtain, verify, andentereverydatapoint.Identification QualityIdentification quality problems result from a failure torecognize the relationship between two objects. Forexample, two similar products with different SKUs areincorrectlyjudgedtobethesame.Identification quality may have significant associatedcosts, such as mailing the same household more thanonce. Data quality processes can largely eliminate thisproblem by matching records, identifying duplicatesand placing a confidence score on the similarity ofINTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY266

2014 IJIRT Volume 1 Issue 7 ISSN: 2349-6002records. Ambiguously scored records can be reviewedand judged by a data steward. Still, the results arenever perfect, and determining the proper businessrules for matching can involve trial and error.Integration QualityIntegration quality, or quality of completeness, canpresent big challenges for large organizations.Integration quality problems occur because informationis isolated by system or departmental boundaries. Itmight be important for an auto claims adjuster to knowthat a customer is also a high-value life insurancecustomer, but if the auto and life insurance systems arenot integrated, that information will not be available.While the desire to have integrated information mayseem obvious, the reality is that it is not alwaysapparent. Business users who are accustomed toworking with one set of data may not be aware thatother data exists or may not understand its value. Datagovernance programs that document and promoteenterprise data can facilitate the development of datawarehousing and master data management systems toaddressintegrationissues.MDM enables the process of identifying records frommultiple systems that refer to the same entity. Therecords are then consolidated into a single masterrecord. The data warehouse allows the transactionaldetails related to that entity to be consolidated so thatits behaviors and relationships across systems can beassessedandanalyzed.Usage QualityUsage quality often presents itself when datawarehouse developers lack access to legacy sourcedocumentation or subject matter experts. Withoutadequate guidance, they are left to guess the meaningand use of certain data elements. Another scenariooccurs in organizations where users are given the toolsto write their own queries or create their own reports.Incorrect usage may be difficult to detect and quantifyincost.Thorough documentation, robust metadata, and usertraining are helpful and should be built into any newinitiative, but gaining support for a postimplementation metadata project can be difficult.Again, this is where a data governance program shouldbe established and a grassroots effort made to identifyand document corporate systems and data definitions.This metadata can be injected into systems andprocesses as it becomes part of the culture to do so.IJIRT 101220This may be more effective and realistic than a bigbangapproachtometadata.Aging QualityThe most challenging aspect of aging quality isdetermining at which point the information is no longervalid. Usually, such decisions are somewhat arbitraryand vary by usage. For example, maintaining a formercustomer's address for more than five years is probablynot useful. If customers haven't been heard from inseveral years despite marketing efforts, how can we becertain they still live at the same address? At the sametime, maintaining customer address information for ahomeowner's insurance claim may be necessary andeven required by law. Such decisions need to be madeby the business owners and the rules should bearchitected into the solution. Many MDM tools providea platform for implementing survivorship and agingrules.Organizational QualityOrganizational quality, like entry quality, is easy todiagnose and sometimes very difficult to address. Itshares much in common with process quality andintegration quality but is less a technical problem thana systematic one that occurs in large organizations.Organizational issues occur when, for example,marketing tries to "tie" their calculations to finance.Financial reporting systems generally take an accountview of information, which may be very different thanhow the company markets the product or tracks itscustomers. These business rules may be buried in manylayers of code throughout multiple systems. However,the biggest challenge to reconciliation is getting thevarious departments to agree that their A equals theother's B equals the other's C plus D.A Strategic ApproachThe first step to developing a data strategy is toidentify where quality problems exist. These issues arenot always apparent, and it is important to developmethods for detection. A thorough approach requiresinventorying the system, documenting the business andtechnical rules that affect data quality, and conductingdata profiling and scoring activities that give us insightintheextentoftheissues.After identifying the problem, it is important to assessthe business impact and cost to the organization. Thedownstream effects are not always easy to quantify,especially when it is difficult to detect an issue in thefirst place. In addition, the cost associated with aINTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY267

2014 IJIRT Volume 1 Issue 7 ISSN: 2349-6002particular issue may be small at a departmental levelbut much greater when viewed across the entireenterprise. The business impact will drive businessinvolvement and investment in the effort.Finally, once we understand the issues and their impacton the organization, we can develop a plan of action.Data quality programs are multifaceted. A single toolor project is not the answer. Addressing data qualityrequires changes in the way we conduct our businessand in our technology framework. It requiresorganizational commitment and long-term vision.The strategy for addressing data quality issues requiresa blend of analysis, technology, and businessinvolvement. When viewed from this perspective, anMDM program is an effective approach. MDMprovides the framework for identifying qualityproblems, cleaning the data, and synchronizing itbetween systems. However, MDM by itself won'tresolvealldataqualityissues.An active data governance program empowered bychief executives is essential to making theorganizational changes necessary to achieve success.The data governance council should set the standardsfor quality and ensure that the right systems are inplace for measurement. In addition, the companyshould establish incentives for both users and systemdeveloperstomaintainthestandards.The end result is an organization where attention toquality and excellence permeate the company. Such anapproach to enterprise information quality takesdedication and requires a shift in the organization'smindset. However, the results are both achievable andprofitable.Data WarehousingData warehouses are one of the foundations of theDecision Support Systems of many IS operations. Asdefined by the “father of data warehouse”, William H.Inmon, a data warehouse is “a collection of Integrated,Subject-Oriented, Non Volatile and Time Variantdatabases where each unit of data is specific to someperiod of time. Data Warehouses can contain detaileddata, lightly summarized data and highly summarizeddata, all formatted for analysis and decision support”(Inmon, 1996). In the “Data Warehouse Toolkit”,Ralph Kimball gives a more concise definition: “a copyof transaction data specifically structured for query andanalysis” (Kimball, 1998). Both definitions stress theIJIRT 101220data warehouse’s analysis focus, and highlight thehistoricalnature of the data found in a data warehouse.Figure 2: Data Warehousing StructureStages of Data Warehousing Susceptible toData Quality ProblemsThe purpose of paper here is to formulate a descriptivetaxonomy of all the issues at all the stages of DataWarehousing. The phases are:Data SourceData Integration and Data ProfilingData Staging and ETLDatabase Scheme (Modeling)Quality of data can be compromised depending uponhow data is received, entered, integrated, maintained,processed (Extracted, Transformed and Cleansed) andloaded. Data is impacted by numerous processes thatbring data into your data environment, most of whichaffect its quality to some extent. All these phases ofdata warehousing are responsible for data quality in thedata warehouse. Despite all the efforts, there still existsa certain percentage of dirty data. This residual dirtydata should be reported, stating the reasons for thefailure in data cleansing for the same.Data quality problems can occur in many differentways [9]. The most common include:Poor data handling procedures and processes.Failure to stick on to data entry and maintenanceprocedures.Errors in the migration process from one system toanother.External and third-party data that may not fit withyour company data standards or may otherwise be ofunconvinced quality. The assumptions undertaken arethat data quality issues can arise at any stage of datawarehousing viz. in data sources, in data integration &profiling, in data staging, in ETL and databasemodeling. Following model is depicting the possiblestages which are vulnerable of getting data qualityproblemsINTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY268

2014 IJIRT Volume 1 Issue 7 ISSN: 2349-60022.Figure 1: Data Warehousing StructureII.CONCULSIONData quality is an increasingly serious issue fororganizations large and small. It is centralto all data integration initiatives. Before data can beused effectively in a data warehouse, or in customerrelationship management, enterprise resource planningor businessanalytics applications, it needs to be analyzed andcleansed. To ensure high quality data is sustained,organizations need to apply ongoing data cleansingprocesses and procedures, and to monitor and trackdata quality levels over time. Otherwise poor dataquality will lead to increased costs, breakdowns in thesupply chain and inferior customer relationshipmanagement. Defective data also hampers businessdecision making and efforts to meet regulatorycompliance responsibilities. The key to successfullyaddressing data qualityis to get business professionals centrally involved inthe process. Informatica DataExplorer and Informatica Data Quality are unique,easy-to-use data quality softwareproducts specifi cally designed to bridge the gapbetween and better align IT and the business, providingthem with all they need to be able to control dataquality processes enterprise-wide in order to reducecosts, increase revenue, and improve business decisionmaking.7 Sources of Poor Data /0611/2.htm3. Data Quality in Data Warehouse & BusinessIntelligence Informatica Data Quality andInformatica Data Explorer4. Channah F. Naiman, Aris M. Ouksel (1995)“AClassification of Semantic Conflicts inHeterogeneous Database Systems”, Journal ofOrganizational Computing, Vol. 5, 19955. John Hess (1998), “Dealing With MissingValues In The Data Warehouse” A Report ofStonebridge Technologies, Inc (1998).6. Jaideep Srivastava, Ping-Yao Chen (1999)“Warehouse Creation-A Potential RoadblocktoData Warehousing”, IEEE Transactions on Knowledgeand Data Engineering January/February 1999 (Vol. 11,No. 1) pp.7. Amit Rudra and Emilie Yeo (1999) “KeyIssues in Achieving Data Quality andConsistency in Data Warehousing amongLargeOrganizationsinAustralia”,Proceedings of the 32nd Hawaii InternationalConference on System Sciences – 19998. Jesús Bisbal et all (1999) “LegacyInformationSystems: Issues and Directions”, IEEE SoftwareSeptember/ October 19999. Scott W. Ambler (2001) “Challenges withlegacy data: Knowing your data enemy is thefirst step in overcoming it”, Practice Leader, AgileDevelopment, Rational Methods Group,IBM, 01 Jul 2001.10. Scott W. Ambler “The Joy of Legacy Data”available 2.htmlREFERENCES1 A Descriptive Classification of Causes of DataQuality Problems in Data Warehousing byRanjitSingh, Dr. Kawaljeet Sing, Research Scholar,University College of Engineering (UCoE), PunjabiUniversity Patiala (Punjab), INDIAIJIRT 101220INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY269

warehousing Viz. 1) data sources, 2) data integration & data profiling, 3) Data staging and ETL, 4) data warehouse modeling & schema design. The state-of-the-art purpose of the paper is to identify the reasons for data deficiencies, non-availability or reach ability problems at all the aforementioned stages of data warehousing and to