6.0 ETL: Extraction, Transformation And Loading

Transcription

6.0 ETL: Extraction,Transformation and LoadingEugene Rex L. Jalao, Ph.D.Associate ProfessorDepartment Industrial Engineering and Operations ResearchUniversity of the Philippines DilimanModule 2 of the Business Intelligence and Analytics Track ofUP NEC and the UP Center of Business Intelligence

Outline for This Training1. Introduction to Data Warehousing2. DW Lifecycle and Project Management–Case Study on DW PM3. Dimensional Modeling4. Designing Fact Tables5. Designing Dimension Tables–Case Study on Dimension Modeling6. Extraction Transformation and Loading–Case Study on ETL Planning7. Transformation and Loading Methodologies–Case Study on ETLE.R. L. Jalao, UP NEC, eljalao@up.edu.ph2

Outline for This Session What is ETL? ETL Process– Kimball’s 38 Step Method– Schmitz ETL Roadmap HW/DB Architecture Considerations ETL Tools Extraction– Data Profiling– Source-to-Target Mapping Case StudyE.R. L. Jalao, UP NEC, eljalao@up.edu.ph3

The Kimball LifecycleE.R. L. Jalao, UP NEC, eljalao@up.edu.ph4

Mindset The importance of data qualityScalabilityMaintainabilityIt is cheaper and faster to do things right the firsttimeE.R. L. Jalao, UP NEC, eljalao@up.edu.ph5

What is ETL? Extracting and cleaning data from source systems,transforming it to the desired dimensional form,and loading it into the target data warehousestructure (dimensional) According to Kimball: Extract, Clean and Conform,Deliver, Manage Also According to Schmitz : Extract, Clean,Transform, and LoadE.R. L. Jalao, UP NEC, eljalao@up.edu.ph6

ETL SystemSourceSystemsNo Data Warehouse OptionOLTP DatabasesERP loratoryData AnalysisData MiningAnalystsETL SystemData WarehouseOptimizationOther DatabasesExternal DataSimulationE.R. L. Jalao, UP NEC, eljalao@up.edu.phManagers/DecisionMakers7

Disclaimer The target is not a copy of operational data–––––Content enhancedIt is CleanedIt is ConformedIntegratedHistorical context added to historical transaction orevent dataE.R. L. Jalao, UP NEC, eljalao@up.edu.ph8

The Harsh Reality ETL is a major failure point in data warehousing Underestimating the effort involved in the ETLprocess is rampant Underestimating data quality problems is a primeculprit Providing for contextual history is another– Have management make the decision whether the goalis to provide optimal analytical capabilities to thebusiness or lessen the ETL effort Scalability and performance are crucialE.R. L. Jalao, UP NEC, eljalao@up.edu.ph9

ETL Overview It is not a one time event as new data is added tothe Data Warehouse periodically – monthly, daily,hourly Because ETL is an integral, ongoing, and recurringpart of a data warehouse– Automated– Well documented– Easily changeableE.R. L. Jalao, UP NEC, eljalao@up.edu.ph10

The Good News A standardized approach and proven techniquesand templates can exponentially lessen theamount of effort required and can ensurescalability and performance Must start from here, and probably won’t be ableto go back and re-design or codeE.R. L. Jalao, UP NEC, eljalao@up.edu.ph11

Where do We Start Understand Our Target Data WarehouseEnvironment–––––The Business Accessible Data WarehouseDimensional Schema BasicsDimension Warehouse KeysDimension Attribute History TrackingETL in the DW ProjectE.R. L. Jalao, UP NEC, eljalao@up.edu.ph12

Outline for This Session What is ETL? ETL Process– Kimball’s 38 Step Method– Schmitz ETL Roadmap HW/DB Architecture Considerations ETL Tools Extraction– Data Profiling– Source-to-Target Mapping Case StudyE.R. L. Jalao, UP NEC, eljalao@up.edu.ph13

Need for an ETL Process 70% of the effort to build the data warehouse isETL Too many data warehouses are built by trial anderror ETL Kimball and Schmitz say you need an ETLmethodology for consistency and productivity Kimball defines 38 steps in his high level ETLmethodology Schmitz’s Methodology focuses on HOW to buildthe tables using intermediate tablesE.R. L. Jalao, UP NEC, eljalao@up.edu.ph14

ETL Process Parts Data Sourcing– Document Source Data– Analyze Source Data (format, values, quality)– Determine the amount of history to be initially loaded Is it all in the same form Is there contextual history– Determine the currency requirement– Determine how to incrementally extract new andchanged dataE.R. L. Jalao, UP NEC, eljalao@up.edu.ph15

ETL Process Parts Model TargetMap source elements to target elementsDefine transformation rulesDevelop the extraction and transport processesDevelop the transformation processesDevelop the load and update processesE.R. L. Jalao, UP NEC, eljalao@up.edu.ph16

The Historical Load Understand if all historical data is in the sameformat or not Test data quality and processes extensively withsmaller samples of data With the caveat that you must do volume testingto make sure that your processes are scalable Break up the production load into manageablepiecesE.R. L. Jalao, UP NEC, eljalao@up.edu.ph17

Dimension and Fact ETL Processes Facts and Dimensions– Should be designed and built separately– Design Dimensions First - Keys For Fact Tables– Some special dimensions like transactions may beprocessed during the fact table processE.R. L. Jalao, UP NEC, eljalao@up.edu.ph18

Kimball’s 38 Subsystems Recommended Best Practices List of thecomponents of an ETL System for any DataWarehouse Bad News: 38 Components! Good News: Exhaustive See Exhibit 9E.R. L. Jalao, UP NEC, eljalao@up.edu.ph19

Kimball’s 38 Subsystems Extract – 3 steps– Gathering raw data and writing to disk before any processing isperformed Transform (Clean and Conform) – 5 Steps– Cleaning the data and creating conformed dimensions and facts Load (Deliver) – 17 Processes– Structuring and loading the data into relational and multidimensionalstructured dimensional databases Manage – 13 Processes– Managing the entire ongoing ETL process – automatically and manuallyE.R. L. Jalao, UP NEC, eljalao@up.edu.ph20

Schmitz’ ETL Method One-page template provides a straightforwardroadmap that can be reused to build and maintainall dimensions. Uses intermediate tables in staging area toorganize and stage data as it goes through ETLprocess ETL process can be hand-coded or used with ETLtoolsE.R. L. Jalao, UP NEC, eljalao@up.edu.ph21

Schmitz’ Generic ETL RoadmapE.R. L. Jalao, UP NEC, eljalao@up.edu.ph22

Intermediate Tableswith Naming Convention D: Dimension table F: Fact table S: Source table – contains all data copied directlyfrom a source file X: eXtract table – contains changed source dataonly. Changes may be from an incremental extractor derived from a full extract. C: Clean table – contains source data rows thathave been cleanedE.R. L. Jalao, UP NEC, eljalao@up.edu.ph23

Intermediate Tableswith Naming Convention E: Error table – contains error rows found in sourcedata M: Master table – maintains history of all cleansource rows T: Transform table – contains the data resulting froma transformation of source data (merges, splits,transformations of clean rows from one or moresource tables)E.R. L. Jalao, UP NEC, eljalao@up.edu.ph24

Intermediate Tableswith Naming Convention I: Insert table – contains new data to be insertedin dimension tables U: Update table – contains changed data toupdate dimension tables H: History table – contains dimension historytable dataE.R. L. Jalao, UP NEC, eljalao@up.edu.ph25

Intermediate Table Formats S, X, C, E, and M tables are in source table (Stable) format T, U, and I tables are in target table (D table)format without the warehouse key D and H tables are in D table format with thewarehouse key F tables are in F table formatE.R. L. Jalao, UP NEC, eljalao@up.edu.ph26

We Want the Optimal Balance Breaks development into logical, debugable steps Provides restart points Allows processing do be done in steps as data isavailable Allows detailed auditing of each step in theprocess Does impact performance so it needs to be doneoptimallyE.R. L. Jalao, UP NEC, eljalao@up.edu.ph27

Why Intermediate Tables It’s a tradeoff between–––––Development timeAvailabilityRestartabilityAuditingPerformance Data arrival time differencesE.R. L. Jalao, UP NEC, eljalao@up.edu.ph28

Outline for This Session What is ETL? ETL Process– Kimball’s 38 Step Method– Schmitz ETL Roadmap HW/DB Architecture Considerations ETL Tools Extraction– Data Profiling– Source-to-Target Mapping Case StudyE.R. L. Jalao, UP NEC, eljalao@up.edu.ph29

Architectural Considerations Hardware and Database Architecture Amount of processing done on source system What types of tools will be usedE.R. L. Jalao, UP NEC, eljalao@up.edu.ph30

Hardware and DatabaseArchitecture Considerations Separate Hardware Server for ETL processing andData Warehouse– Ideal would be to devote extra resources to DW whenETL processing is not happening– Same OS type for publishing efficiency– Con: More ResourcesE.R. L. Jalao, UP NEC, eljalao@up.edu.ph31

Hardware and DatabaseArchitecture Considerations Same Server for both ETL and DW (DW accessdefinitely impacted during ETL processing) Separate database instances– Pro – can configure db parms for ETL, not reallynecessary most of the time– Cons – more systems resources used Same database instance– Pro – less system resources used– Con – Logical separation by schema ownerE.R. L. Jalao, UP NEC, eljalao@up.edu.ph32

Source Processing If operational system is constrained get data asefficiently as possible If extract window on operational system isconstrained make sure that nothing can preventthe extract from running (even a down network)during its window Do as much cleaning as possible here (may bebetter feedback to encourage source system fixes)E.R. L. Jalao, UP NEC, eljalao@up.edu.ph33

Source System Quality Enhancers Establish enforcement of required input fields like“Zip Code” Provide drop down data windows/lookups Online De-duplication A new entry looks for potential matches andprompts to continue with the add or accept oneof the matchesE.R. L. Jalao, UP NEC, eljalao@up.edu.ph34

Outline for This Session What is ETL? ETL Process– Kimball’s 34 Step Method– Schmitz ETL Roadmap HW/DB Architecture Considerations ETL Tools Extraction– Data Profiling– Source-to-Target Mapping Case StudyE.R. L. Jalao, UP NEC, eljalao@up.edu.ph35

Source to Target Packages Several companies have a strong ETL tools and afairly complete suite of supplementary tools Three general types of Source to Target tools– Code generators actually compile ETL code, typically COBOL which is used byseveral large companies that use mainframe– Engine based easy-to-use graphic interfaces and interpreter style programs– Database based manual coding using SQL statements augmented by scripts.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph36

Major Advantages of Engine Based Tools Automatic meta data capture– After initial development is key Myths– Standardization ETL Programmers may not be consistent with process. Betterto follow standard processes– Faster Development Need to learn the tool first– Cheaper developers Experienced developers with better ETL tools are higher paidE.R. L. Jalao, UP NEC, eljalao@up.edu.ph37

Disadvantages of Engine Based Tools Performance– But difference is easing as ETL tools mature Databases are getting more native capabilities– Microsoft has their own ETL system Custom programming can be poor performingand non-scalable also Speed of development decreases when projectgets complexE.R. L. Jalao, UP NEC, eljalao@up.edu.ph38

Well Known ETL Tools Commercial–––––––Ab initioIBM DataStageInformatica PowerCenterMicrosoft Data Integration ServicesOracle Data IntegratorSAP Business Objects – Data IntegratorSAS Data Integration StudioE.R. L. Jalao, UP NEC, eljalao@up.edu.ph39

Well Known ETL Tools Open-Source Based––––––Adeptia Integration SuiteApatarCloverETLPentaho Data Integration (Kettle)Talend Open Studio/Integration SuiteR/R StudioE.R. L. Jalao, UP NEC, eljalao@up.edu.ph40

ETL Recommendations Set Standards and develop highly scalabletemplates Integrate DB stored procedure use Ensure that the tool can use native database fastbulk load capabilities (nologging) Get supplemental non-vendor qualified trainingUse high performing scalable methods andstandards regardless of whether you use an ETLtool or use custom codingE.R. L. Jalao, UP NEC, eljalao@up.edu.ph41

ETL Tools The “best” tool does not exist Choose based on your own needs Check first if the “standard tools” from the bigvendors are okE.R. L. Jalao, UP NEC, eljalao@up.edu.ph42

Outline for This Session What is ETL? ETL Process– Kimball’s 38 Step Method– Schmitz ETL Roadmap HW/DB Architecture Considerations ETL Tools Extraction– Data Profiling– Source-to-Target Mapping Case StudyE.R. L. Jalao, UP NEC, eljalao@up.edu.ph43

Extraction Considerations Availability: Not available due to non existentdata, or poor data quality Restartability: Restart from Beginning Performance: Processing Options Scalability: Scaling-up for future expansion Auditing: Tracking errors Responsibility: Have source system stewards beresponsible for extractionE.R. L. Jalao, UP NEC, eljalao@up.edu.ph44

List of Data Extraction Issues Source Identification —identify source applicationsand source structures. Method of extraction —define whether theextraction process is manual or tool-based. Extraction frequency— establish how frequently thedata extraction must by done—daily, weekly and soon. Time window —for each data source, denote thetime window for the extraction process.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph45

List of Data Extraction Issues Job sequencing —determine whether the beginningof one job in an extraction job stream has to waituntil the previous job has finished successfully. Exception handling —determine how to handle inputrecords that cannot be extractedE.R. L. Jalao, UP NEC, eljalao@up.edu.ph46

Options for Data Extraction Immediate Data Extraction.– Capture through Transaction Logs.– Capture through Database Triggers.– Capture in Source Applications. Deferred Data Extraction.– Capture Based on Date and Time Stamp.– Capture by Comparing Files.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph47

Log MiningSource GUITransactionLogsSourceDatabaseStaging AreaE.R. L. Jalao, UP NEC, eljalao@up.edu.phDW48

Log Mining DBMSs keep detailed before and after recordsrelated to each transaction Use these logs for extracting new or changed data– Use of on-line logs for current changes– Use of archived logs for older changesE.R. L. Jalao, UP NEC, eljalao@up.edu.ph49

Database TriggersSource GUISourceDatabaseStaging AreaE.R. L. Jalao, UP NEC, eljalao@up.edu.phTriggerProgramsDW50

Database Triggers Occurs right at the source and is therefore quitereliable. You can capture both before and after images. Building and maintaining trigger programs puts anadditional burden on the development effort. Execution of trigger procedures duringtransaction processing of the source systems putsadditional overhead on the source systems. This option is applicable only for source data indatabases.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph51

Database TriggersSource GUISourceProgramSourceDatabaseStaging AreaE.R. L. Jalao, UP NEC, eljalao@up.edu.phDW52

Capture in Source Applications Once updated in the source applications e.g. ERPor POS systems, copies of new or changed dataforwarded to data staging area immediately or bybatch. Additional burden on developers Additional performance overhead on liveapplicationsE.R. L. Jalao, UP NEC, eljalao@up.edu.ph53

Capture Based on Date and Time StampSource GUISourceDatabaseStaging AreaE.R. L. Jalao, UP NEC, eljalao@up.edu.phExtractUsing Dateand TimeDW54

Capture Based on Date and Time Stamp If a source record gets deleted in between twoextract runs, the information about the delete isnot detected. You can get around this by marking the sourcerecord for delete first, do the extraction run, andthen go ahead and physically delete the record. This means you have to add more logic to thesource applications.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph55

Capture by Comparing FilesToday’sExtractSource GUISourceDatabaseYesterday’sExtractStaging AreaE.R. L. Jalao, UP NEC, eljalao@up.edu.phFileComparisonProgramDW56

Capture by Comparing Files If none of the above techniques are feasible forspecific source files in your environment, thenconsider this technique as the last resort. This technique is also called the snapshotdifferential technique because it compares twosnapshots of the source data.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph57

Comparison of MethodologiesMethodologyLog Mining Database TriggersCapture in SourceApplications AdvantagesPerformance of sourcesystem not affectedNo revisions of existingsource applicationsNo internal CostsNo Revisions of ExistingSource ApplicationsNo internal Costs Good Flexibility for CaptureSpecifications Can be used on File-BasedSystemsE.R. L. Jalao, UP NEC, eljalao@up.edu.phDisadvantages Not much flexibility forsource capture Cannot be used on filebased systems Performance of SourceSystems Affected a Bit Not much flexibility forsource capture Cannot be used on FileBased Systems Performance of SourceSystems Affected a Bit High Internal Costs due toDevelopment Major Revisions on 58Existing Systems

Comparison of MethodologiesMethodologyCapture Based on Dateand Time Stamp Capture by Comparing Files AdvantagesGood Flexibility for Capture SpecificationsPerformance of source systemnot affectedCan be used on File-BasedSystemsLittle or No internal CostsGood Flexibility for Capture SpecificationsPerformance of source systemnot affectedNo Revisions of Existing SourceApplicationsLittle or No internal CostsDisadvantagesMajor Revisions on ExistingSystemsE.R. L. Jalao, UP NEC, eljalao@up.edu.ph59Major Revisions on ExistingSystems

General Processing Options ETL Tools– Most Common Programs with database calls– Heavy Reliance on SQL Statements SQL set processing statementsE.R. L. Jalao, UP NEC, eljalao@up.edu.ph60

Set Based v Cursor Based SQLOperations Use set-based whenever possible Cursor based can cause severe performancedegradation– processes transactions one row at a timeE.R. L. Jalao, UP NEC, eljalao@up.edu.ph61

Cursor-Based ExampleOpen cursor for select from inputLoop: Fetch input rowSELECT FROM dimension tableWHERE input.operational id dim.operational idIf foundIf changedUPDATE dimension tableelse ignoreElse INSERT new dimension rowE.R. L. Jalao, UP NEC, eljalao@up.edu.ph62

Set-Based ExampleUPDATE dim table dim SET VALUES (SELECT statement FROM input tableWHERE input.operational id dim.operational id);INSERT INTO dim table (SELECTstatement FROM input table WHERENOT EXISTS (SELECT operational idFROM dim table);E.R. L. Jalao, UP NEC, eljalao@up.edu.ph63

Set-based vs Row -basedE.R. L. Jalao, UP NEC, eljalao@up.edu.ph64

Set Based v Cursor Based SQL Operations Set based Pros– Processing efficiency ( 10X)– Scalable (nologging/parallel) Cons– Must be sure to construct auditable processes– Multiple steps often required (still faster than issuingthousands of logged SQL instructions)– Complex logic not straightforwardE.R. L. Jalao, UP NEC, eljalao@up.edu.ph65

Set Based v Cursor Based SQLOperations Cursor based Pros– Most programmers are skilled at this– Complex logic fairly straightforward Cons– Performance– ScalabilityE.R. L. Jalao, UP NEC, eljalao@up.edu.ph66

Outline for This Session What is ETL? ETL Process– Kimball’s 38 Step Method– Schmitz ETL Roadmap HW/DB Architecture Considerations ETL Tools Extraction– Data Profiling– Source-to-Target Mapping Case StudyE.R. L. Jalao, UP NEC, eljalao@up.edu.ph67

Data Quality Issues Three drivers of need for data quality– “If only I could see the data, I could manage it better”heard from all levels– Integration of disparate data, often from distributedsystems scattered worldwide– Increased demand for compliance Some data can be corrected in ETL process, butnot all; some can only be corrected by sourcestewards If data is too dirty, DW project may have to be puton Hold.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph68

About Data Profiling Each source must be profiled, that is, examinedfor data quality issues. This applies to each table or file and the databasestructure itself. This is the basis for cleaning dataduring the ETL process. Failure to identify these issues results inembarrassingly long and even fatal ETL processesand/or data warehouse efforts that fail due toinconsistent data.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph69

About Data Profiling Generally there are more issues with oldermainframe databases than with newer relationaldatabases, but both often have many problems. Data sources can be profiled manually, usuallyusing SQL queries or Excel AutoFilters. For larger or more complex sources, you may wishto employ a profiling tool such as Pentaho, Evoke,SAS DataFlux, Ascential Discovery, or others.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph70

Common Data Quality Issues:Tables and Files Empty source tables (no data)Empty table columnsDuplicate rows or natural keysBlank or Null values in text/character columnsInconsistent values in columnsInvalid values in columnsUnparsed fields such as City/State/Postal CodeCodes about a single item from two or moresources that do not match, such as a customer ora productE.R. L. Jalao, UP NEC, eljalao@up.edu.ph71

Common Data Quality Issues:Database Structures Incorrect relationships between tables therebyviolating 3NF or referential integrity Database structures that do not represent currentbusiness rules Data in two or more different files about a singlesubject (such as customers) where the uniqueidentifiers for an individual instance (a customer)do not matchE.R. L. Jalao, UP NEC, eljalao@up.edu.ph72

Reasons for “Dirty” Data Dummy ValuesAbsence of DataMultipurpose FieldsCryptic DataContradicting DataInappropriate Use of Address LinesViolation of Business RulesReused Primary Keys,Non-Unique IdentifiersData IntegrationE.R.ProblemsL. Jalao, UP NEC, eljalao@up.edu.ph73

Data Cleansing Source systems contain “dirty data” that must becleansed ETL software contains rudimentary data cleansingcapabilities Specialized data cleansing software is often used.Important for performing name and addresscorrection and householding functions Leading data cleansing vendors include Vality(Integrity), Harte-Hanks (Trillium), and Firstlogic(i.d.Centric)E.R. L. Jalao, UP NEC, eljalao@up.edu.ph74

Steps in Data Cleansing gE.R. L. Jalao, UP NEC, eljalao@up.edu.ph75

Parsing Parsing locates and identifies individual dataelements in the source files and then isolatesthese data elements in the target files.Street Address21 Osmeña St.BarangayAddress21 Osmeña St. Brgy UPCampus, Quezon City, 1101UP CampusCityQuezon CityZip Code1101E.R. L. Jalao, UP NEC, eljalao@up.edu.ph76

Correcting Corrects parsed individual data components usingsophisticated data algorithms and secondary datasources. Example include replacing a vanity address andadding a zip code.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph77

Standardizing Standardizing applies conversion routines totransform data into its preferred (and consistent)format using both standard and custom businessrules. Examples include adding a pre name, replacing anickname, and using a preferred street name.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph78

Matching Searching and matching records within and acrossthe parsed, corrected and standardized databased on predefined business rules to eliminateduplications. Examples include identifying similar names andaddresses.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph79

Consolidating Analyzing and identifying relationships betweenmatched records and consolidating/merging theminto ONE representation.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph80

Data Quality Exercise Assume the legacy data below is source data for adata warehouse you are building– What problems are there with this data?– What would you do to solve these problems?E.R. L. Jalao, UP NEC, eljalao@up.edu.ph81

Outline for This Session What is ETL? ETL Process– Kimball’s 34 Step Method– Schmitz ETL Roadmap HW/DB Architecture Considerations ETL Tools Extraction– Data Profiling– Source-to-Target Mapping Case StudyE.R. L. Jalao, UP NEC, eljalao@up.edu.ph82

Some Examples of Data Source Origins Home grown operational applications Home grown departmental applications Home grown spreadsheets, Access databases, orpersonal databases External purchased data Industry applications Third party ERP, SCM, SFA, HR, and Web analyticsapplicationsE.R. L. Jalao, UP NEC, eljalao@up.edu.ph83

TechTarget, BI Research, IT Market Strategy(2010) Sources of DataE.R. L. Jalao, UP NEC, eljalao@up.edu.ph84

A New Dimension To Data Warehousing:2011 Ioug Data Warehousing Survey Sources of DataE.R. L. Jalao, UP NEC, eljalao@up.edu.ph85

E.R. L. Jalao, UP NEC, eljalao@up.edu.ph86

Full or Incremental Extracts Dimension changes may be hard to detect– no update timestamp or dimension transaction history Fact Table data – transaction and event basedrecords are usually easy to detect and extractincrementally– Not always – log mining and replication may help hereE.R. L. Jalao, UP NEC, eljalao@up.edu.ph87

Avoid a Complete Refresh of a Dimension You will usually lose some type of historyConsumes a lot of extra computing resourceShould only be used in special casesUsually only applicable in very small datawarehouses where the operational system retainsall history You will probably still lose some contextual historyE.R. L. Jalao, UP NEC, eljalao@up.edu.ph88

Multiple Dimension Sources: Where to GetData?E.R. L. Jalao, UP NEC, eljalao@up.edu.ph89

Fact Table Sources Mostly transactional and event based tables orrecords New records are usually easy to detect Corrected or updated transaction records may notbe easy to detectE.R. L. Jalao, UP NEC, eljalao@up.edu.ph90

High Level Source-to-Target Map(Template) See Exhibit 7E.R. L. Jalao, UP NEC, eljalao@up.edu.ph91

Develop DetailedSource-To-Target Map See Exhibit 7E.R. L. Jalao, UP NEC, eljalao@up.edu.ph92

Detailed Source-to-Target Map:Target, History, Data Quality See Exhibit 7E.R. L. Jalao, UP NEC, eljalao@up.edu.ph93

Outline for This Session What is ETL? ETL Process– Kimball’s 34 Step Method– Schmitz ETL Roadmap HW/DB Architecture Considerations ETL Tools Extraction– Data Profiling– Source-to-Target Mapping Case StudyE.R. L. Jalao, UP NEC, eljalao@up.edu.ph94

Case Study: Data Profiling and Source-ToTarget Mapping Northwind DatabaseE.R. L. Jalao, UP NEC, eljalao@up.edu.ph95

References Kimball, Ralph, Margy Ross, Warren Thornthwaite, JoyMundy, and Bob Becker, The Data Warehouse Life CycleToolkit, Second Edition, Wiley, 2008, ISBN 978-0-47014977-5 Schmitz, Michael D. UCI Irvine Data Warehousing Notes(2014), High Performance Data Warehousing http://hornad.fei.tuke.sk/ 1028-2055.ppt Simon, Alan. CIS 391 PPT Slides Jeltema ,Bernie, UCI Irvine Data Warehousing Notes(2014), Strategic Frameworks, Inc.E.R. L. Jalao, UP NEC, eljalao@up.edu.ph96

Schmitz' ETL Method One-page template provides a straightforward roadmap that can be reused to build and maintain all dimensions. Uses intermediate tables in staging area to organize and stage data as it goes through ETL process ETL process can be hand-coded or used with ETL tools E.R. L. Jalao, UP NEC, eljalao@up.edu.ph 21