Transcription
DEEP DIVEModern Data Pipelines:Improving Speed, Governance and Analysiswww.dmradio.biz
Featured Speakers
Constraints Drive DesignWhen conditions change, objectives mustHighway design commensurate with trafficNo more Moore, massive parallelism better?Maybe some applications really should dieNo time like the present to begin anews!
ArchitectureMatters
Engineering RevolutionEnabled SkyscrapersBy Joe Mabel
Adopted by the EU, but affects the USA
Don’t Forget the Basics: Costs Matter! Modern solutions havecost structures too Project planning willalways be a moving target Build in some financialbuffers to help ensurelong-term success
Share your plans withkey stakeholders!Good communicationhelps to ensure success!
Process Matters: Continuous Improvement Is Key The faster you see value, the more engagedyour stakeholders will be Create a virtuous circle of improvement byspreading the wealth Evangelize success stories; pat your users onthe back whenever appropriate Starting small is important, but have a longterm plan in mind; this can always change “Say yes” whenever possible, even if it’s atentative “yes” for the near term
In search of theperfect data s tackA brief history of Datawarehousing, ETL, BI, andData Governance
OVERVIEW OFPRESENTATION-History - looking at trends-Dates are roughly statedTAYLOR BROWNCOO & Cofoundertaylor@fivetran.com
2000’s
2000’s Data Stackhttp://www.bogotobogo.com/Hadoop/BigData hadoop OLTP vs OLAP.php
2000’s Warehouse- OLAP CubesFast option for analytics vsOLTP DatabasesGenerally slow and expensiveinfrastructureCost for 1GB The -OLAP-cube-is-history/ba -p/231673#.WynTwhJKjMU
2000’s Data Pipelines- ETLExtract, Transformation & LoadInformatica or custom code Heavily cus tomizedT ype, column, table mappingT rans form data prior to loadAggregations performed inpipeline
2000’s Data Governance Hardened systems Centralized planning Good Data Governance
2000’s BI ToolsHeavy Monolithic BI tools for ReportingCognos, Hyperion, Microstrategy W hat happened in pas t?V ery AccurateV ery inflexible.Hardened s ys temsMonths to changehttps://www.element61.be/en/resource/sap -business-warehouse -business-objects -front -end-integration -what -available-today
2000s Total Stack 5 Toolshttp://www.bogotobogo.com/Hadoop/BigData hadoop OLTP vs OLAP.php
2000s Team Structure 6 TeamsExecutive Team / ManagementProject adoop/BigData hadoop OLTP vs OLAP.phpAnalystsBusinessUsers
2006
2006’s Challenges with OLAP Data AvailabilityInflexibilitySpeedCompromise with end usersData volumes
2006’s Warehouse- Column Store MMP On -Prem DBEach node has a portion ofthe data (sharded tables!)Column -Store designed foranalytical queriesMassively Parallel Processing(MPP) - Queries (jobs) divided upbetween the nodes in thecluster, each one does a portionof the workQueryTeradata, HP Vertica, IBMNetezza, Oracle Exadata LeaderNodeFollower Nodes
2006’s Stack
2006’s 5 Tools & 6 TeamsExecutive Team / ManagementProject ManagementEngineeringITAnalystsBusinessUsers
2008
2008’s Self Service BIAsking of data, why did thishappen?Tableau, Qlik Drill downE xploreS till in data s ilosMultiple vers ions of truth
2008’s Data GovernanceMore data, more consumers. More complex data. Multipleversion of the same truth. Decentralized BI tools.Herding Cats!
2011
2011’s Challenges with on -prem MPP Column store warehousesVariety of DataVariety of Analytics
2011’s Hadoop to the Rescue!Built to:Scale to Big DataHandle all forms of dataAllow any type of analytics
2011’s Hadoop Stack
2011’s Total Stack 6 Tools
2011’s Still complicated team structure 6 TeamsExecutive Team / ManagementProject ManagementEngineeringITAnalystsBusinessUsers
2013
2013’s Issues with HadoopEasy to dump data into a hadoop data lake hard to manage data and extract value. C omplicated low level s etup &maintenance R equires experienced developmentteamsUltimately companies end up s ending datafrom Hadoop to S QL databas e for Analytics .Dead end!
2013’s - MPP Column Store in the Cloud- Redshift!Fast, affordable EDW on AWS - awesome! MPP ScalesFar less expensive than on -prem Column Store EDWFairly easy to resize clusters etc1 GB of data 0.05
2013 Cloud -Native Self Serve BIGoal: Allow both centralized control of data, but also selfserve to entire company.LookerMake data so accessible, it starts to change the culture atthe company to be more data driven. Us ing data to try to predict futureS ingle vers ion of the truthF ull data acces s ibilityS uper fas t, query directly agains t the DW H
2015
2015’s Challenges with Redshift“There’s a 99% chance thatthe default configurationwill not work for you!” Lars Kamphttps://www.intermix.io/ - https://www.quora.com/What -problems -have-you-faced -while-working -with -Amazon -Redshift
2015s Cloud -Native Column -store MPP Data Warehouses1. Separation of compute & storage2. Zero infrastructure management3. Structured & Unstructured data4. Instantly Scalable Compute
Separation of Compute & StorageQueryLeaderNodeData is stored in flatfiles in an Object Store(S3, Google CloudStorage, etc)“Infinite storage”Data is copied onto thenodes in the cluster atcompute time
No more queue issues!ETLRun many warehouseclusters off of samedata sets!FinanceBI
Elastic ComputeRe-size cluster in seconds!
Data SharingFivetranShare data acrosscompanies!Acme CoBob’s Plumbing
How does this affect ETL?
Recap of changesWarehouses20 0 0 OL AP20 0 6 On-premC olumn S tore MMPETLBI20 0 0 MonolithicR igid B I20 0 0 C us tom E T L20 0 8 S elf S erve B I20 11 Hadoop20 0 0 C loudC olumn S tore MMP20 15 C loud NativeC olumn S tore MMP20 13 C entralizedC loud Native S elfS erve B I?
Challenges with ETL from 2000’sETL was optimized for slow on -premise OLAP datawarehouses, with massive storage constraints.Optimized for pulling from on -premise enterpriseapplications
ExtensiveSetup
OngoingMaintenance
ExtensivePlanning
2015’s Shift in company structuresWith move to cloud, IT teams areshrinking.Analyst at the front of self serve BI andwant: S imple Infras tructure fully managed s ervices wholis tic control over s tack
Other changes since 2000Rise of cloud applicationsDrop in data storage 1GB 0.02Agile workflows
CLASSIC ETLExtract -Transform -Load -Visualize -MODERN DATA STACK- Extract- Load- Transform- Visualize
Agile Cloud -Native Self Serve AnalyticsFULL SCHEMA REPLICATION(DATA LAKE)- MODERN DATA STACK- Extract- LoadSTANDARDIZED SCHEMAS- TransformSQL-BASED MODELING &TRANSFORMATIONS- VisualizeCENTRALIZED &COLLABORATIVE
MODERN DATA STACKModularizereplication(Extraction & Load)from Transformation( Data gov )- Extract- Load- Transform- Visualize
MODERN DATA STACKSimplifies yourmanagement stack3 Teams- Extract- LoadProject ManagementAnalysts- TransformEngineeringITBusinessUsersExecutive - VisualizeTeam /Management
Recap of changesWarehouses20 0 0 OL AP20 0 6 On-premC olumn S tore MMPBI20 0 0 MonolithicR igid B IETL20 0 0 C us tom E T L20 0 8 S elf S erve B I20 11 Hadoop20 0 0 C loudC olumn S tore MMP20 15 C loud NativeC olumn S tore MMP20 13 C entralizedC loud Native S elfS erve B I20 15 E L TS eparate E L & T rans form
Zero Configuration, Zero Maintenance, Data Pipelines
Fivetran helps youachieve data acces s ibilitywith its zero configuration,zero maintenancedata pipelines
Data pipeline as a serviceAPPLICATIONSDATABASESFILESEVENTSYour warehouse
DatabasesApplicationsApple S earch AdsAs anaAdR ollB ing AdsB raintree P aymentsDes k.comDoubleC lickDynamics (3 6 5 , GP , AX)E loquaFacebook Ad Ins ightsFres hdes kFrontGithubGoogle AdwordsGoogle AnalyticsGoogle P layHelp S coutHubS n lNetSuite SuiteAnalyticsOptimizelyP ardotP interes t AdsQuickB ooks OnlineR eC hargeR ecurlyFor an updated list of data sources visit fivetran.com/directoryS ailthruSalesforceS ales forceIQS AP B us ines s OneS endGridS hopifyS tripeS ugarC R MT witter AdsXeroY ahoo GeminiZendeskZendes k C hat (Zopim)ZuoraAmazon AuroraAmazon R DSAzure S QL Databas eDynamoDBGoogle C loud S QLHerokuMariaDBMongoDBMySQLOracle DBPostgreSQLSQL ServerFilesAmazon C loudfrontAmazon K ines is F irehos eAmazon S3Azure B lob S torageC S V UploadDropboxE mail C S V Inges terFTPFTPSGoogle C loud S torageGoogle S heetsSFTPEventsGoogle Analytics 3 6 0S egmentS nowplowW ebhooks
Authenticate, and we do the rest.1Pull historical Data2Normalize3Create Schema/Tables& Load DataUpdate
Fivetran Data Normalization BehaviorSOURCEWe normalizeDenormalized Data(from APIs)We replicateNormalized Schemas(Databases, SFDC, Netsuite)WAREHOUSE
Standard Schemas - ERDs
Incremental Batch UpdatesSOURCEINSERTUPDATEDELETEWAREHOUSE
Automatic Schema MigrationsSOURCEADD COLUMNREMOVE COLUMNCHANGE TYPEADD OBJECTREMOVE OBJECTWAREHOUSE
Complexity compounds. Automate,s tandardize and s implify as much of yours tack as you can.
Recap of changesWarehouses20 0 0 OL AP20 0 6 On-premC olumn S tore MMPBI20 0 0 MonolithicR igid B IETL20 0 0 C us tom E T L20 0 8 S elf S erve B I20 11 Hadoop20 0 0 C loudC olumn S tore MMP20 15 C loud NativeC olumn S tore MMP20 13 C entralizedC loud Native S elfS erve B I20 15 E L TS eparate E L & T rans form
What’s coming next?Feedback, questions, thoughts?Taylor@fivetran.com
Zero Configuration, Zero Maintenance, Data Pipelines
The faster you see value, the more engaged your stakeholders will be Create a virtuous circle of improvement by spreading t