Developing A Data Quality Scorecard That Measures Data .

Transcription

Developing a Data Quality Scorecard thatMeasures Data Quality in a DataWarehouseA thesis submitted for the degree of Doctor of PhilosophyByAderibigbe GrilloCollege of Engineering, Design and Physical SciencesBrunel UniversityNovember 2018

ABSTRACTThe main purpose of this thesis is to develop a data quality scorecard (DQS) that aligns thedata quality needs of the Data warehouse stakeholder group with selected data qualitydimensions. To comprehend the research domain, a general and systematic literature review(SLR) was carried out, after which the research scope was established. Using Design ScienceResearch (DSR) as the methodology to structure the research, three iterations were carried outto achieve the research aim highlighted in this thesis. In the first iteration, as DSR was used asa paradigm, the artefact was build from the results of the general and systematic literaturereview conduct. A data quality scorecard (DQS) was conceptualised. The result of the SLR andthe recommendations for designing an effective scorecard provided the input for thedevelopment of the DQS. Using a System Usability Scale (SUS), to validate the usability ofthe DQS, the results of the first iteration suggest that the DW stakeholders found the DQSuseful. The second iteration was conducted to further evaluate the DQS through a run throughin the FMCG domain and then conducting a semi-structured interview. The thematic analysisof the semi-structured interviews demonstrated that the stakeholder's participants‘ found theDQS to be transparent; an additional reporting tool; Integrates; easy to use; consistent; andincreases confidence in the data. However, the timeliness data dimension was found to beredundant, necessitating a modification to the DQS. The third iteration was conducted withsimilar steps as the second iteration but with the modified DQS in the oil and gas domain. Theresults from the third iteration suggest that DQS is a useful tool that is easy to use on a dailybasis. The research contributes to theory by demonstrating a novel approach to DQS designThis was achieved by ensuring the design of the DQS aligns with the data quality concern areasof the DW stakeholders and the data quality dimensions. Further, this research lay a goodfoundation for the future by establishing a DQS model that can be used as a base for furtherdevelopment.2

This thesis is dedicated to the Grillo Family3

Table of ContentsTable of Contents . 4List of Figures. 7List of tables. 10Acknowledgement . 11Chapter 1: Introduction . 121.1Overview . 121.2Research background . 121.3The Research Problem. 181.4Research Aim and Objective. 191.5Research Methodology . 201.6Thesis Layout . 23Chapter 2: Research Design . 262.1 Introduction . 262.2 Design Science Research (DSR) Paradigm . 262.3 Research Methods and Techniques . 362.4 Practical Application of DSR in this Research . 412.4.1 First DSR Iteration Cycle . 422.4.2 Second DSR Iteration Cycle . 452.4.3 Third DSR Iteration Cycle . 472.5 Summary. 49Chapter 3: Data warehouse, Data Quality Dimensions and Scorecards Literature . 503.1 Introduction . 503.2 The Data Warehouse Domain . 503.2.1 Data Quality . 523.2.2 Dimensions of Data Quality . 543.2.3 Data Quality Model Foundations . 583.2.4 Data Quality in Data warehouses . 603.2.5 Data Quality Tools . 703.2.6 State of the art of Data quality . 723.3 Stakeholders and Data Quality Goals . 783.3.1 Stakeholder Data Quality Perception. 884

3.3.2 Stakeholder Data quality Concern Areas . 913.4 Designing an effective Data Quality Scorecard – DQS . 953.5 Existing Scorecard Design –Systematic Literature Review . 973.5.1 Systematic Literature Review Analysis. 1003.5.2 Limitations of Existing DQS . 1033.6 Summary. 104Chapter 4: DQS Model Development and Validation – Iteration I. 1054.1 Overview . 1054.2 DQS Model Development . 1064.3 Scorecard Mechanics – Electronic and Web-centric DQS Development. 1114.3.1 Walkthrough of the Web-centric DQS. 1144.4 DQS Validation . 1164.4.1 Data Collection Techniques. 1164.4.2 SUS Questionnaire Design. 1174.4.3 Participants . 1184.4.4 Procedure . 1194.4.5 SUS Evaluation Results . 1194.4.6 Analysis of results. 1214.5 Summary. 122Chapter 5: DQS Evaluation-Iteration II . 1245.1 Introduction . 1245.2 About Brewing Company Ltd. 1245.2.1 The Data Quality Problem at Brewery Ltd . 1255.2.2 Web-centric DQS-Iteration 2 . 1315.3 Evaluation of DQS . 1375.3.1 Participants . 1375.3.2 Procedure . 1385.3.3 Data Collection Mechanism . 1385.3.4 Semi-Structured Interviews . 1385.3.5 Results of Evaluation . 1395.3.6 Analysis of Results . 1465.3.7 Discussion. 1475.4 Summary. 155Chapter 6: DQS Evaluation - Iteration III . 1576.1 Introduction . 1576.2 About company Oil and Gas Ltd. (O&G Ltd.) . 1585

6.2.1 The Data Quality Problem at Oil and Gas Ltd: . 1586.2.2 Web-centric DQS-Iteration III . 1656.3 Evaluation of DQS . 1716.3.1 Participants . 1716.3.2 Procedure . 1726.3.3 Data Collection Mechanism . 1726.3.4 Semi-Structured Interviews . 1736.3.5 Results of Evaluation . 1746.3.6 Analysis of Results . 1796.3.7 Discussion. 1816.4 Summary . 187Chapter 7: Conclusions and Further Research . 1907.1 Overview . 1907.2 Research Summary . 1907.3 Research Contribution . 1997.3.1 Contribution to Practice . 2017.3.2 Contribution to Theory . 2037.4 Reflection of Research Methodology . 2057.5 Research Limitations . 2077.6 Future Work . 2087.7 Personal Reflection . 208References . 210Appendix . 234Appendix A: Questionaire items . 234Appendix B: SUS Raw Questionaire Results . 235Appendix C: Inter-Question Correlation Matrix. 236Appendix D: Qualitative Data Items – Brewery Ltd . 236Appendix E: Qualitative Data Items – Oil and Gas Ltd . 2396

List of FiguresFigure 1: Data Quality Dimensions and description . 14Figure 2: Data Quality approaches . 17Figure 3: Summary of Iteration activities . 22Figure 4: Design Science Research Methodology . 29Figure 5: Design Research Outputs . 33Figure 6: Data Warehouse Environment .56Figure 7: Dimensions of Data Quality. .59Figure 8: Hierarchy of Data Quality issues. .66Figure 9: Data warehouse quality factors. 68Figure 10: Capability Maturity Model Levels for data warehouse .71Figure 11: Potential Data quality issue areas .78Figure 12: Design and Administration Quality Dimensions .84Figure 13: Data usage quality dimensions .85Figure 14: DW stages susceptible to issues of data quality .86Figure 15: Structure of understanding relationships between stakeholder groups and data quality dimensions indata warehouse surroundings .90Figure 16: Relationship between stakeholder kinds of data quality dimensions and classifications .91Figure 17: Traffic light measurement indicator .99Figure 18: Relationship between stakeholder, data quality dimensions and classifications 1137

Figure 19: Proposed DQS framework .114Figure 20: Data Quality Scorecard front page .116Figure 21: CSS code structure .117Figure 22: DQS Landing page .118Figure 23: Login page after selection of stakeholder group 119Figure 24: DQS Questions aligned with DQD and stakeholder requirements 119Figure 25: SUS questionnaire items . 121Figure 26: Summary of results of the first DSR Iteration cycle . .123Figure 27: Brewery Ltd. SAP benefit Chart .128Figure 28: Brewery Ltd.’s SAP data warehouse Modelling workbench 1 . .130Figure 29: Brewery Ltd.’s SAP data warehouse Modelling workbench 2 .131Figure 30: DQS Website Login screen .132Figure 31: Stakeholder specific login screen .133Figure 32: Data Producer DQS . 133Figure 33: Data Custodian DQS . 134Figure 34: Data Manager DQS 134Figure 35: Data Consumer DQS . 135Figure 36: DQS Email screen . 135Figure 37: DQS report area . .136Figure 38: below shows the list of reports . .136Figure 39: Individual Stakeholder DQS report .137Figure 40: Thematic map of initial 9 central themes 1448

Figure 41: Final thematic map with 6 main themes .145Figure 42: Summary of results of the second DSR Iteration cycle . .155Figure 43: Structured Query Language Data Quality Metric .158Figure 44: DQS Login Screen .164Figure 45: Stakeholder specific login screen 165Figure 46: DQS version selection screen .165Figure 47: Data Consumer DQS v2 .168Figure 48: Data Custodian DQS v2 .167Figure 49: Data Manager DQS v2 167Figure 50: DQS Email screen v2 .168Figure 51: DQS report area .168Figure 52: Stakeholder list of reports .169Figure 53: Individual Stakeholder DQS report 169Figure 54: Thematic map of initial 10 central themes .174Figure 55: Final thematic map with 7 main themes. 175Figure 56: Summary of results of the third DSR Iteration cycle 186Figure: 57 Data quality scorecard stakeholder group selection screen . .195Figure 58: Log on screen based on initial role selection .195Figure 59: Web-based scorecard .196Figure 60: DQS v2 Screen .1979

List of tablesTable 1: Paradigms, methodologies and Methods 28Table 2: Design Science Research Outputs . 35Table 3: Measures and improvement strat

The main purpose of this thesis is to develop a data quality scorecard (DQS) that aligns the data quality needs of the Data warehouse stakeholder group with selected data quality dimensions. To comprehend the research domain, a general and systematic literature review (SLR) was car