Benchmarking ETL Workflows - TPC

Transcription

Benchmarking ETL WorkflowsAlkis Simitsis1, Panos Vassiliadis2, Umeshwar Dayal1, Anastasios Karagiannis2, Vasiliki Tziovara21Intelligent Information Management LabHewlett-Packard LabsPalo Alto, CA, USA2University of IoanninaGreecealkis@hp.comJoint work with Qiming Chen, Bin Zhang, Ren Wupresented by Kevin Wilkinson1 2008 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice 2008 Hewlett-Packard

ETL workflowsDS.PS NEWDS.PS NEW1.PKEY,DS.PS OLD1.PKEYSUPPKEY 1DS.PS1.PKEY,LOOKUP PS.SKEY,SUPPKEYCOSTDATE1DIFF1DS.PS1Add SPK1SK1rejectedDS.PS OLDA2EDate 2 rejectedUrejected1DS.PS NEWDS.PS NEW2.PKEY,DS.PS OLD2.PKEYSUPPKEY 2LogLogLogDS.PS2.PKEY,LOOKUP PS.SKEY,SUPPKEYCOSTDATE SYSDATEQTY 02DIFF2DS.PS2Add SPK2NotNULLSK2rejectedDS.PS OLDAddDateCheckQTYrejected2LogLogDSAPKEY, DAYMIN(COST)S1 DAYS2 PARTSUPPFTP2Sources2A.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009TIMEDW V1PKEY, MONTHAVG(COST)Aggregate2V2

ETL Tools Commercial Ab Initio SAP Business Objects IBM WebSphere Information Integration Informatica PowerCenter Microsoft SSIS Oracle Warehouse Builder Pervasive SAS Data Integration Studio Open Source Clover Pentaho Kettle Talend3A.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

ETL Tools 1. ActaWorks , Acta Technologies26. Data EXTRactor , DogHouse Enterprises2. Amadea , ISoft27. Data Flow Manager , Peter's Software3. ASG-XPATH,AllenSystemsGroup28.eIntegrationData Junction,Content, Data Junction51. DataProF , IT Consultancy Group BV76.Suite, TavizExtractorTechnology4. AT SigmaW-Import , Advanced29.EnvironmentData Manager, Joe Spanicek52. DataPropagator, IBM Technologies77.Manager, WhiteLight Technology101.iMergence, iMergenceTechnologies126.MineWorks/400, Computer Professional Systems5. AutoImport, WhiteCraneSystems30.e-SenseData Mapper,,AppliedDatabase Technology53. DataProvider, OrderSoftware tionTools6. igrationTools,Friedman&Associates54. DataPump for SAP R/3 , Transcope AG79. ETI Extract , Evolutionary Technologies,Inc.103. InfoLink/400, Batcom, DatawatchCorporation7. Blue55.DataMiner, PeopleSoftInformationBuilders, Inc.PL/Loader, Hanlon Consulting179.,, PPDInformaticsDataStageXE ,DataAscentialSoftware80.Engine, 128.FireSprout104. InfoManager, InfoManager129.Mozart, ,Magma8. Catalyst, are152.PointOut, mSEGmbHOy180.TextAgent ,SolutionsTasc,Inc.56. DataSuite, SolutionsPathlightDataSystems81.Manager, InfoPump,ComputerAssociates130.mpower,AbInitio9. acorporate153.Power*Loader181. TextPipeSoftware Australia57. Datawhere, MiabSystems Ltd.Suite , SQL Power Group82. eWorker Portal, eWorkerLegacy, orm, Cymfony , Powersoft83.131.MRESolutionsIQ10. CerebellumPortalIntegrator, CerebellumSoftware35.eXadasData3 ,, InformInformationSystems , ationTextProc200058. DataX, twork , D2KNatQuery, NatWorks,Inc11. Checkmate, BitbyBitInternational36.e-zMigrateDataBlaster, 132.2 , Bus-Tech,Inc.155.PowerMart,Ltd.Informatica183. Textractor, Textkernel59. DataXPress, etConvert,TheWorkstationGroup, Ltd.12. wTechnology156.PowerStage , Sybase184. Tilion , Tilion60. DB/Access, Datastructure85. EZ-Pickin's , ExcelSystems109. ,InfoScannerWisoSoftCom134.NGS-IQ, New Generation13. ert, MetadataInformationPartners,Software157.Rapid, Data, Open UniversalSoftware185. TransporterFountainDigital Fountain61.TextMapDBMS/Copy, ConceptualSoftware,Inc.86., SoftLink110.InScribe, CriticalPath House, Liant Software CorporationDataStager ,, ComputerNSX Software14. Compleo, Symtrax39.File-AID/ExpressDataDigger135., ssociates62. temTechConsulting15. aBaseTechnology159. SystemsRelational Tools , Princeton Softech187.NetworkViewShark, infoShark63. DEAP I , DEAP88. FileSpeed , ComputerTechnology112.ISIE, Artaud,Courthéoux& Associés137.OLAPData Migrator , Legacy to Web Solutions16. ConversionsPlus, DataViz41.FormwareDatagration, Paladyne64. DecisionBaseComputer89., CaptivaSoftware160., ReTarGet,AssociatesTominy188.Vignette Business Integration Studio , Vignette113. John Convert/VSAMHenry, Acme Software138.OmniReplicator, Lakeview Technology17. Convert/IDMS-DB,, Forecross Corporation42.FOXTROTDataImport, SpaldingSoftware65. DecisionStream,Cognos90.,EnableSoft,Inc.161. Rodin , Coglin Mill Pty Ltd.189. Visual Warehouse , IBM114.KM.Studio,Knowmadic139.OpalisRendezVous, Opalis18. ataLever66. DECISIVE 162.Advantage91. Fusion FTMS , ProginetRoll-Up, ,InfoSAGE,IronbridgeInc.Software190. Volantia , VolantiaLiveTransfer, IntellicorpOpenExchange , CorporationIST19. Gate/1DataLoad, 140.SoftwareTechnologies67. ta163. Sagent Solution , Sagent Technology, Inc.191. vTag Web , Connotate Technologies116., CorporationBMC Software141.OpenMigrator20. CrossXpress, CrossAccess45. DataManager, JoeSpanicek , PrismTech68. DETAIL,LOADPLUSStrivaTechnology164. SAS/WarehouseAdminstrator , SAS Institute 93. Génio , Hummingbird192.CommunicationsWaha , BeaconLtd.Information s142.OpenWizardProfessional, OpenDataSystems21. . DistributionAgentfor MVS, Sybase, Appligator.com94. Gladstone Conversion, Gladstone165.SchemerAdvanced193.PackageWarehouse, Taurus ComputerSoftware edSolutions,Inc.22. Cyklop, Tokab SoftwareAB Management47.GoHunterDataMiner, Gordian, Placer Group70. DocuAnalyzer,Mobius95.Data166. Scribe Integrate , Scribe Software Corporation194. Warehouse Executive , Ardent SoftwareMarketDrive, I-Impact, IncS.L.144.OracleWarehouseBuilder, Oracle Corporation23. Data, APE48.GraphicalDataMirrorConstellarHub, DataMirrorCorporation71.CycleDQ119.Now, SoftwareDQ ScriptoriaNowComponents96.Performance, VanguardSolutions167., Bunker Hill195. SeriesWarehousePlus , tSystems24. Server,DataMirrorCorporation72. DQtransformTechnologies97. Harvester , Object196.TechnologyUK WorkbenchInc.168., MetagonSERdistiller, SER SolutionsWarehouse, Systemfabrik121. Mercator, TSI International146.Outbound25. Data, XSB50.HIRELDataPipe, CrystalSoftware , Firesign Computer Company73.ExchangeDT/Studio, EmbarcaderoTechnologies98., SWSSoftware169.Signiant , Signiant197.ServicesWeb Automation , webMethods122.Meta Integration Works , Meta Integration Technologies147., Pinnacle Software74. DTS, Microsoft99. HummingbirdETL Parse-O-Matic,198.Hummingbird170. SIPINA PRO , DiagnosWeb DataLtdKit , ,GuySoftware75. eCartographyAMB Dataminers,Inc.100. iManageData , BioCompSystems171.,SpeedLoader, BenchmarkConsulting199. WebMining , Blossom Software124. MetaTrans , Metagenix149. pcMainframe , cfSOFTWARE172. SRTransport , Schema Research Corp.200. Web Replicator , Media Consulting125. MinePoint , MinePoint150. PinnPoint Plus , Pinnacle Decision Systems173. StarQuest Data Replicator , StarQuest Software201. WebFOCUS ETL Manager , Information Builders, Inc.174. StarTools , StarQuest202. WebQL , Caesius Software175. Stat/Transfer , Circle Systems203. WhizBang! Extraction Library , WhizBang! Labs176. Strategy , SPSS204. Wizport , Turning Point177. Sunopsis , Sunopsis205. Xentis , GrayMatter Software Corporation178. SyncSort Unix , Syncsort206. XSB , XSB Inc.Commercial Ab Initio SAP Business Objects IBM WebSphere Information Integration Informatica PowerCenter Microsoft SSIS Oracle Warehouse Builder Pervasive SAS Data Integration Studio Open Source Clover Pentaho Kettle Talend4206 ETL tools, current as of 2003http://www.dblab.ntua.gr/ asimi/A.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

Outline 5MotivationGoal of the benchmark Effectiveness EfficiencyBenchmark parameters Experimental parameters Measured effectsETL flows Micro-level: activities Macro-level: workflowsSpecific scenariosOpen issuesA.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

Outline 6MotivationGoal of the benchmark Effectiveness EfficiencyBenchmark parameters Experimental parameters Measured effectsETL flows Micro-level: activities Macro-level: workflowsSpecific scenariosOpen issuesA.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

Motivation An ETL benchmark can be used as a comparison method for ETL tools ETL methods (algorithms) ETL designs for experimenting with ETL workflows for optimizing ETL workflows logical [ICDE05, TKDE05] and physical [DOLAP07] optimization QoX-driven optimization [EDBT09, SIGMOD09] what are the important problem parameters & what are the realisticvalues for them? what test suites should we use?7A.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

Motivation 8Existing standards are insufficient TPC-H TPC-DSPractical cases are not publishable and hard to findWe resort in devising our own ad-hoc test scenarios either through a specific set of scenarios or, through a scenario generator (will not touch this here)A.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

Outline 9MotivationGoal of the benchmark Effectiveness EfficiencyBenchmark parameters Experimental parameters Measured effectsETL flows Micro-level: activities Macro-level: workflowsSpecific scenariosOpen issuesA.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

Goal of this work 10We are interested in understanding The important parameters to be tuned in an experiment &the appropriate values for them The appropriate measures to be measured during anexperiment The fundamental families of activities performed in an ETLscenario The frequent ways with which activities and recordsetsinterconnect in an ETL scenarioA.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

Fundamental goals of any ETL flow Effectiveness Quality objectives as performance, recoverability, reliability, freshness, maintainability,scalability, availability, flexibility, robustness, affordability,consistency, traceability, auditability Data should respect both database and business rules Typical questions Q1. Does the workflow execution reach the maximum possible levelof data freshness, completeness, and consistency in the warehousewithin the necessary time (or resource) constraints? Q2. Is the workflow execution resilient to occasional failures? Q3. Is the workflow easily maintainable?11A.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

Fundamental goals of any ETL flow Efficiency Typically ETL processes should run within strict timewindows Achieving high performance enables other qualities as well Typical questions 12Q4. How fast is the workflow executed?Q5. What degree of parallelization is required?Q6. How much pipelining does the workflow use?Q7. What resource overheads does the workflow incur at the source,intermediate (staging), and warehouse sites?A.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

Outline 13MotivationGoal of the benchmark Effectiveness EfficiencyBenchmark parameters Experimental parameters Measured effectsETL flows Micro-level: activities Macro-level: workflowsSpecific scenariosOpen issuesA.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

Experimental parameters 14Parameters for the measurement of ETL workflows: P1. the size of the workflow P2. the structure of the workflow P3. the size of input data originating from the sources, P4. the workflow selectivity P5. the values of probabilities of failure, P6. the latency of updates at the warehouse P7. the required completion time P8. the system resources (e.g., memory, processing power) P9. the “ETL workload” and the number of instances of theworkflows that should run concurrentlyA.Simitsis, P. Vassiliadis, U. Dayal, A. Karagiannis, V. Tziovara @TPC-TC’09, Lyon, France – August 24, 2009

Measures Q1. Measures for data freshness and data consistency % data that violate business rules / are not present at the DW Q2. Measures for the resilience to failures MTBF, MTTR, #rec points, resumption type, #replicas, ETL uptime Q3. Measures for maintainability (qualitative objective) Flow length, complexity, modularity, coupling Q4. Measures for the speed of the overall process Throughput of workflow execution: regular, w/ failures, avg latency per tuple inregular execution Q5. Measures for partitioning parallelism Partition type, number/length/data volume of b

Pentaho Kettle Talend 4 1. ActaWorks , Acta Technologies 26. Data EXTRactor , DogHouse Enterprises 2. Amadea , ISoft 27. Data Flow Manager , Peter's Software 3. ASG-XPATH , Allen Systems Group 28. Data Junction, Content Extractor , Data Junction 4. AT Sigma W-Import , Advanced Technologies 29. Data Manager , Joe Spanicek 5. AutoImport , White Crane Systems 30. Data