ETL Overview Extract, Transform, Load (ETL) General ETL .

Transcription

ETL OverviewExtract, Transform, Load (ETL) General ETL issues ETL/DW refreshment processBuilding dimensionsBuilding fact tablesExtractTransformations/cleansingLoad MS Integration ServicesOriginal slides were written byTorben Bach PedersenThe ETL ProcessAalborg University 2007 - DWML course2Refreshment Workflow The most underestimated process in DW development The most time-consuming process in DW development 80% of development time is spent on ETL! Extract Extract relevant data Transform IntegrationphaseTransform data to DW formatBuild keys, etc.Cleansing of data Load Load data into DWBuild aggregates, etc.PreparationphaseAalborg University 2007 - DWML course3Aalborg University 2007 - DWML course4

ETL In The ArchitectureData Staging Area (DSA) Transit storage for data in the ETL processMetadataDatasourcesPresentation servers- Extract- Transform- LoadData StagingArea Query sideETL sideQueryServices-Warehouse Browsing-Access and SecurityData marts with-Query Managementaggregate-only data- Standard ReportingConformed -Activity MonitorDataWarehouse dimensionsBusand factsReporting ToolsDesktop DataAccess Tools No user queries Sequential operations on large data volumes Data mining ServiceElementAalborg University 2007 - DWML course 5ETL Construction Process 2)3) 5)6) Make high-level diagram of source-destination flowTest, choose and implement ETL toolOutline complex transformations, key generation and jobsequence for every destination table 8)9)10) Construct and test building static dimensionConstruct and test change mechanisms for one dimensionConstruct and test remaining dimension builds DW key assignment: production keys to DW keys using tableCombination of data sources: find common key?Check one-one and one-many relationships using sortingDescribed in last lectureFind the newest DW key for a given production keyTable for mapping production keys to DW keys must be updated Load of dimensions Construct and test initial fact table buildConstruct and test incremental updateConstruct and test aggregate build (you do this later)Design, construct, and test ETL automationAalborg University 2007 - DWML course6 Handling dimension changesConstruction of fact tables and automation7)Aalborg University 2007 - DWML course Static dimension tableConstruction of dimensions4)Often too time consuming to initial load all data marts by failureBackup/recovery facilities neededBetter to do this centrally in DSA than in all data martsBuilding DimensionsPlan1)Performed by central ETL logicNo need for locking, logging, etc.RDBMS or flat files? (DBMS have become better at this) Finished dimensions copied from DSA to relevant marts Allows centralized backup/recoveryOperationalsystemData marts withatomic dataDataTransformations/cleansing done here 7Small dimensions: replaceLarge dimensions: load only changesAalborg University 2007 - DWML course8

Building Fact TablesTypes of Data Sources Two types of load Initial load Non-cooperative sourcesETL for all data up till nowDone when DW is started the first timeVery heavy - large data volumes Incremental update Move only changes since last loadDone periodically (e.g., month or week) after DW startLess heavy - smaller data volumes Cooperative sources Dimensions must be updated before facts The relevant dimension rows for new facts must be in placeSpecial key considerations if initial load must be performed againAalborg University 2007 - DWML course9 Extract from source systems can take long time Extract applications (SQL): co-existence with other applicationsDB unload tools: faster than SQL-based extracts Delta can easily be computed from current last extract Always possible Handles deletions- High extraction time Put update timestamp on all rows (in sources) Extraction can take days/weeksDrain on the operational systemsDrain on DW systems Extract/ETL only changes since last load (delta)Aalborg University 2007 - DWML course10 Delta changes since last load Store sorted total extracts in DSA Extract applications the only solution in some scenarios Too time consuming to ETL all data at each load Aalborg University 2007 - DWML courseComputing Deltas Types of extracts: DB triggers is an example Extract strategy depends on the source types Goal: fast extract of relevant data Replicated sources – publish/subscribe mechanismCall back sources – calls external code (ETL) when changes occurInternal action sources – only internal actions when changes occur Extract Snapshot sources – provides only full copy of source, e.g., filesSpecific sources – each is different, e.g., legacy systemsLogged sources – writes change log, e.g., DB logQueryable sources – provides query interface, e.g., RDBMS 11Updated by DB triggerExtract only where “timestamp time for last extract” Reduces extract time- Cannot (alone) handle deletions- Source system must be changed, operational overheadAalborg University 2007 - DWML course12

Changed Data CaptureCommon Transformations Messages Applications insert messages in a “queue” at updates Works for all types of updates and systems- Operational applications must be changed operational overhead DB triggers Triggers execute actions on INSERT/UPDATE/DELETE Operational applications need not be changed Enables real-time update of DW- Operational overhead 13 Unique Spellings, codings, Production keys, comments, Replace codes with text (Why?)No contradictory data: aggregates fit with detail data The same things is called the same and has the same key(customers)City name instead of ZIP code, Combine data from multiple sources with common key Timely Pre-processing necessary for BI analysis Remove unnecessary attributesDW has all relevant data and the users know 14 Handle inconsistent data formatsCompleteConsistent DW data must match known numbers - or explanation needed Aalborg University 2007 - DWML course BI does not work on “raw” dataPrecise Table matches production keys to surrogate DW keysCorrect handling of history - especially for total reloadCleansing Data almost never has decent quality Data in DW must be: To the desired DW formatDepending on source format Building keysData Quality EBCDIC ASCII/UniCodeString manipulationsDate/time format conversions Normalization/denormalizationFind changes directly in DB log which is written anyway Operational applications need not be changed No operational overhead- Not possible in some DBMSAalborg University 2007 - DWML course Replication based on DB log Data type conversionsE.g., customer data from customer address, customer name, Data is updated ”frequently enough” and the users know whenAalborg University 2007 - DWML course15Aalborg University 2007 - DWML course16

Types Of CleansingCleansing Conversion and normalization Text coding, date formats, etc.Most common type of cleansing Mark facts with Data Status dimension Special-purpose cleansing Normalize spellings of names, addresses, etc.Remove duplicates, e.g., duplicate customers Uniform treatment of NULL Domain-independent cleansing Approximate, “fuzzy” joins on records from different sources Rule-based cleansing Mark facts with changed status 17Improving Data Quality Responsibility for data qualityIncludes manual inspections and corrections! The optimal? Are totals as expected?Do results agree with alternative source?Number of NULL values? Large overhead (optimization, locking, etc.) for every SQL callDB load tools are much fasterDrop index and rebuild after loadCan be done per index partition Parallellization Allow management to see “weird” data in their reports?Such data may be meaningful for them? (e.g., fraud detection)Aalborg University 2007 - DWML courseLoading deltas is much faster than total load Index on tables slows load a lot Do not fix all problems with data quality 18 SQL-based update is slow Construct programs that check data quality Aalborg University 2007 - DWML course Goal: fast loading into DW Source-controlled improvements New customer, Customer about to cancel contract, Load Appoint “data quality administrator” Avoid problems in joins, since NULL is not equal to NULLGuess missing sales person based on customer and itemAalborg University 2007 - DWML course Use explicit NULL value rather than “special” value (0,-1, )Use NULLs only for measure values (estimates instead?)Use special dimension keys for NULL dimension values User-specifed rules, if-then styleAutomatic rules: use data mining to find patterns in data Normal, abnormal, outside bounds, impossible, Facts can be taken in/out of analyses 19Dimensions can be loaded concurrentlyFact tables can be loaded concurrentlyPartitions can be loaded concurrentlyAalborg University 2007 - DWML course20

LoadETL Tools ETL tools from the big vendors Relationships in the data Referential integrity and data consistency must be ensured (Why?)Can be done by loader Aggregates Offers much functionality at a reasonable priceCan be built and loaded at the same time as the detail data Load tuning Load without logSort load file firstMake only simple transformations in loaderUse loader facilities for building aggregates Data modelingETL code generationScheduling DW jobs The “best” tool does not exist Should DW be on-line 24*7? Oracle Warehouse BuilderIBM DB2 Warehouse ManagerMicrosoft Integration Services Choose based on your own needsCheck first if the “standard tools” from the big vendors are OKUse partitions or several sets of tables (like MS Analysis)Aalborg University 2007 - DWML course21IssuesAalborg University 2007 - DWML course22MS Integration Services Pipes Redirect output from one process to input of another processls grep 'a' sort -r A concrete ETL tool Example ETL flow Demo Files versus streams/pipes Streams/pipes: no disk overhead, fast throughputFiles: easier restart, often only possibility ETL tool or not Code: easy start, co-existence with IT infrastructureTool: better productivity on subsequent projects Load frequency ETL time dependent of data volumesDaily load is much faster than monthlyApplies to all steps in the ETL processAalborg University 2007 - DWML course23Aalborg University 2007 - DWML course24

Integration Services (IS)Packages Microsoft’s ETL tool Part of SQL Server 2005 The central concept in IS Package for: Tools Import/export wizard - simple transformationsBI Development Studio – advanced development Functionality available in several ways Through GUI - basic functionalityProgramming - advanced functionality Aalborg University 2007 - DWML course25Package Control Flow Groups tasks and containers intocontrol flows that are subsets of thepackage control flowProvides services to a single taskAalborg University 2007 - DWML courseBulk insert – fast load of dataExecute SQL – execute any SQL query Scripting Tasks Script – execute VN .NET code Analysis Services Tasks Task host container Execute package – execute other IS packages, good for structure!Execute Process – run external application/batch file SQL Servers TasksRepeat tasks by testing a conditionSequence container File System – operations on filesFTP – up/down-load data Workflow TasksRepeat tasks by using an enumeratorFor loop container Foreach loop container 26 Data Flow – runs data flows Data Preparation TasksStructure to packagesServices to tasks Control flow Aalborg University 2007 - DWML courseTasks “Containers” provide Sources, ConnectionsControl flowTasks, WorkflowsTransformationsDestinations Arrows:green (success)red (failure) Analysis Services Processing – process dims, cubes, modelsAnalysis Services Execute DDL – create/drop/alter cubes, models Maintenance Tasks – DB maintenance27Aalborg University 2007 - DWML course28

Event HandlersData Flow Elements Sources Executables (packages,containers) can raise events Event handlers manage the events Similar to those in languagesJAVA, C#Makes external data availableAll ODBC/OLE DB data sources:RDBMS, Excel, Text files, Transformations UpdateSummarizeCleanseMergeDistribute Destinations Write data to specific storeCreate in-memory data set Input, Output, Error outputAalborg University 2007 - DWML course29Aalborg University 2007 - DWML courseTransformationsA Simple IS Case Business intelligence transformations Use BI Dev Studio/Import Wizard to copy TREO tables Save in Term Extraction - extract terms from textTerm Lookup – look up terms and find term counts Row Transformations Character Map - applies string functions to character dataDerived Column – populates columns using expressions Aggregate - performs aggregationsSort - sorts dataPercentage Sampling - creates sample data set by setting % Conditional Split - routes data rows to different outputsMerge - merges two sorted data setsLookup Transformation - looks up ref values by exact match Error messages? Steps execute in parallelExport Column - inserts data from a data flow into a fileImport Column - reads data from a file and adds it to a data flowSlowly Changing Dimension - configures update of a SCDAalborg University 2007 - DWML courseDROP, CREATE, source, transformation, destination Execute package Other Transformations Available from mini-project web page Look at package parts Split and Join Transformations SQL ServerFile system Look at package structure Rowset Transformations (rowset tabular data) 30 31But dependencies can be set upAalborg University 2007 - DWML course32

ETL Demo Load data into the Product dimension table Construct the DW key for the table by using “IDENTITY”Copy data to the Product dimension table Load data into the Sales fact table Aalborg University 2007 - DWML course33Join “raw” sales table with other tables to get DW keys for eachsales recordOutput of the query written into the fact tableAalborg University 2007 - DWML courseETL Part of Mini ProjectA Few Hints on ETL Design C

Extract, Transform, Load (ETL) Original slides were written by Torben Bach Pedersen Aalborg University 2007 - DWML course 2 ETL Overview General ETL issues ETL/DW refreshment process Building dimensions Building fact tables Extract Transformations/cleansing Load MS Integration Services Aalborg University 2007 - DWML course 3 The ETL ProcessFile Size: 370KBPage Count: 10