ETL Testing Analyzer - CORE

Transcription

ETL Testing Analyzer-MASTER THESISIn Partial Fulfillmentof the Requirements for the DegreeInformation Technologies for Business IntelligencebyClaudiu Ionut MurarDecember,2014Author: Claudiu Ionut MurarSupervisors: Oscar Romero, Alberto Abello

iiABSTRACTETL testing techniques are nowadays widely used into the data integration process. Thesetechniques refer to the ability of being able to check whether the information that is loaded intothe data warehouse has correctly followed all the transformations steps. Due to the errors thatmight occur during the extraction, transformation and load stages, there is a need to monitor andhandle the errors that can cause severe data quality issues into the data warehouse.This thesis paper is based on the information gathered from a previous project that wasperformed at UPC. The main goal of this project was to help the professors from UPC who areteaching the “Database” course, to better analyze the performance of the students. Therefore, anETL system was implemented with the goal of extracting student information from multiplesources, transform and load it into the data warehouse. This information can refer to the student’spersonal data, the exercises they are performing, etc. The initial ETL design was based on thecreation of the data warehouse schema containing the main dimensions and fact tables. The mainissue is that it did not present any monitoring and error handling functionalities, even though thesystem was generating several errors every time the ETL was executed. The steps I havefollowed while working on this thesis project have been to model the initial ETL process using aBPMN representation, include error handling and monitoring functionalities and ultimately redesign the initial ETL processes using a chosen tool. Although the initial processes weremodelled using Pentaho Kettle, due to the new requirements regarding the error handling andmonitoring capabilities I had to perform a comprehensive ETL tool comparison to check what isthe tool that can better answer the requirements of this project.The last part of my thesis was to build a web application that can allow the final user of thissystem to interactively run the ETL from the application interface. The user should be allowed toinput the necessary data sources that will be fetched by the ETL extraction components andinsert the necessary database connection parameters. Moreover, after the user runs/executes theETL flow the application should display information about the existing errors that might haveoccurred during the execution, in a user friendly and understandable manner.

iiiACKNOWLEDGMENTSI am grateful to all the people who supported me throughout my thesis work. I would like toexpress my sincere gratitude to my advisors Oscar Romero and Alberto Abello for thecontinuous support of my study and research, for their patience, motivation and enthusiasm.Their guidance helped me during all the time of research and writing of this thesis.I would also like to thank the rest of my professors for their encouragement, insightfulcomments, and all the help they have provided me. Last but not the least, I would like to thankmy family: my parents Marian Murar and Titiana Murar, for supporting me spirituallythroughout my life. I am deeply grateful to my sister as well, who has been advising and morallysupporting me during the last 6 months.Barcelona, December 2014MURAR CLAUDIU IONUT

ivTABLE OF CONTENTSAbstract .iiAcknowledgments . iiiList of Tables . viiList of Figures .viii1. INTRODUCTION. 11.1BACKGROUND . 11.2 CONTEXT . 21.3 MOTIVATION . 21.4 PROBLEM AREA . 31.5 OBJECTIVES . 41.6 INITIAL PLANNING . 41.7STRUCTURE OF THE REPORT. 52. RESEACH BACKGROUND . 72.1 ETL TESTING TECHNIQUES . 72.2 ETL TESTING CHALLENGES . 72.3ETL TOOLS . 82.3.1 General Information . 82.3.2 ETL tools comparison criteria . 92.4 CONCEPTUAL DESIGN OF ETL PROCESSES .103. ETL CONCEPTUAL DESIGN USING BPMN . 123.1INITIAL ETL DESIGN .123.1.1Data Sources .123.1.2 Main dimensions and Fact Tables .143.2ETL DESIGN ERRORS .153.2.1 ETL failing causes . 163.2.2 Design consistencies . 173.2.3 Missing functionalities.173.3 BPMN REPRESENTATION: STATIC LEVEL . 173.3.1Create Monitor/Error logging tables . 183.3.2 Generate ETL ID .203.3.3 Insert Candidate Dimension . 203.3.4 Insert Delivery System Dimension . 223.3.5Insert Experiment Dimension . 233.3.6Insert Item Dimension . 233.3.7Insert Thematic Dimension . 253.3.8Insert Request Time Dimension . 253.3.9Create/Update Temporary Tables . 263.3.10 Experiments Fact Table . 263.3.11 Invalid Fact Table . 273.3.12 Valid Fact Table .273.3.12.1 Update response processing . 293.4 BPMN REPRESENTATION: DYNAMIC LEVEL . 293.4.1 CONTROL FLOW .29

v3.4.2 Insert Candidate Dimension . 313.4.2.1 Compute course and semester information . 323.4.3 Insert Delivery System Dimension . 333.4.4Insert Experiment Dimension . 343.4.5Insert Item Dimension . 343.4.6Insert Thematic Dimension . 353.4.7Insert Request Time Dimension . 353.4.8Create/Update Temporary Tables . 363.4.9 Experiments Fact Table . 373.4.10 Invalid Fact Table . 383.4.11 Valid Fact Table .393.4.11.1 Update response processing .414. ETL TOOLS . 424.1 TOOLS COMPARISON .434.2 TOOL SELECTION JUSTIFICATION . 455. ETL DESIGN USING TALEND OS . 465.1 CONTROL FLOW.465.2 CREATE LOGGING TABLES .475.2.1 Monitoring tables . 475.2.2 Error logging tables .

ETL testing concepts appeared for the specific challenges that can occur during the data integration stage. These challenges can refer to: integrating data from multiple sources and populating a data warehouse with information. When dealing with ETL operations, the first step is to extract data and prepare it for further transformations. The transform and load steps are normally applying a set .