DEEP DIVE - DATAVERSITY

Transcription

DEEP DIVEModern Data Pipelines:Improving Speed, Governance and Analysiswww.dmradio.biz

Featured Speakers

Constraints Drive DesignWhen conditions change, objectives mustHighway design commensurate with trafficNo more Moore, massive parallelism better?Maybe some applications really should dieNo time like the present to begin anews!

ArchitectureMatters

Engineering RevolutionEnabled SkyscrapersBy Joe Mabel

Adopted by the EU, but affects the USA

Don’t Forget the Basics: Costs Matter! Modern solutions havecost structures too Project planning willalways be a moving target Build in some financialbuffers to help ensurelong-term success

Share your plans withkey stakeholders!Good communicationhelps to ensure success!

Process Matters: Continuous Improvement Is Key The faster you see value, the more engagedyour stakeholders will be Create a virtuous circle of improvement byspreading the wealth Evangelize success stories; pat your users onthe back whenever appropriate Starting small is important, but have a longterm plan in mind; this can always change “Say yes” whenever possible, even if it’s atentative “yes” for the near term

In search of theperfect data s tackA brief history of Datawarehousing, ETL, BI, andData Governance

OVERVIEW OFPRESENTATION-History - looking at trends-Dates are roughly statedTAYLOR BROWNCOO & Cofoundertaylor@fivetran.com

2000’s

2000’s Data Stackhttp://www.bogotobogo.com/Hadoop/BigData hadoop OLTP vs OLAP.php

2000’s Warehouse- OLAP CubesFast option for analytics vsOLTP DatabasesGenerally slow and expensiveinfrastructureCost for 1GB The -OLAP-cube-is-history/ba -p/231673#.WynTwhJKjMU

2000’s Data Pipelines- ETLExtract, Transformation & LoadInformatica or custom code Heavily cus tomizedT ype, column, table mappingT rans form data prior to loadAggregations performed inpipeline

2000’s Data Governance Hardened systems Centralized planning Good Data Governance

2000’s BI ToolsHeavy Monolithic BI tools for ReportingCognos, Hyperion, Microstrategy W hat happened in pas t?V ery AccurateV ery inflexible.Hardened s ys temsMonths to changehttps://www.element61.be/en/resource/sap -business-warehouse -business-objects -front -end-integration -what -available-today

2000s Total Stack 5 Toolshttp://www.bogotobogo.com/Hadoop/BigData hadoop OLTP vs OLAP.php

2000s Team Structure 6 TeamsExecutive Team / ManagementProject adoop/BigData hadoop OLTP vs OLAP.phpAnalystsBusinessUsers

2006

2006’s Challenges with OLAP Data AvailabilityInflexibilitySpeedCompromise with end usersData volumes

2006’s Warehouse- Column Store MMP On -Prem DBEach node has a portion ofthe data (sharded tables!)Column -Store designed foranalytical queriesMassively Parallel Processing(MPP) - Queries (jobs) divided upbetween the nodes in thecluster, each one does a portionof the workQueryTeradata, HP Vertica, IBMNetezza, Oracle Exadata LeaderNodeFollower Nodes

2006’s Stack

2006’s 5 Tools & 6 TeamsExecutive Team / ManagementProject ManagementEngineeringITAnalystsBusinessUsers

2008

2008’s Self Service BIAsking of data, why did thishappen?Tableau, Qlik Drill downE xploreS till in data s ilosMultiple vers ions of truth

2008’s Data GovernanceMore data, more consumers. More complex data. Multipleversion of the same truth. Decentralized BI tools.Herding Cats!

2011

2011’s Challenges with on -prem MPP Column store warehousesVariety of DataVariety of Analytics

2011’s Hadoop to the Rescue!Built to:Scale to Big DataHandle all forms of dataAllow any type of analytics

2011’s Hadoop Stack

2011’s Total Stack 6 Tools

2011’s Still complicated team structure 6 TeamsExecutive Team / ManagementProject ManagementEngineeringITAnalystsBusinessUsers

2013

2013’s Issues with HadoopEasy to dump data into a hadoop data lake hard to manage data and extract value. C omplicated low level s etup &maintenance R equires experienced developmentteamsUltimately companies end up s ending datafrom Hadoop to S QL databas e for Analytics .Dead end!

2013’s - MPP Column Store in the Cloud- Redshift!Fast, affordable EDW on AWS - awesome! MPP ScalesFar less expensive than on -prem Column Store EDWFairly easy to resize clusters etc1 GB of data 0.05

2013 Cloud -Native Self Serve BIGoal: Allow both centralized control of data, but also selfserve to entire company.LookerMake data so accessible, it starts to change the culture atthe company to be more data driven. Us ing data to try to predict futureS ingle vers ion of the truthF ull data acces s ibilityS uper fas t, query directly agains t the DW H

2015

2015’s Challenges with Redshift“There’s a 99% chance thatthe default configurationwill not work for you!” Lars Kamphttps://www.intermix.io/ - https://www.quora.com/What -problems -have-you-faced -while-working -with -Amazon -Redshift

2015s Cloud -Native Column -store MPP Data Warehouses1. Separation of compute & storage2. Zero infrastructure management3. Structured & Unstructured data4. Instantly Scalable Compute

Separation of Compute & StorageQueryLeaderNodeData is stored in flatfiles in an Object Store(S3, Google CloudStorage, etc)“Infinite storage”Data is copied onto thenodes in the cluster atcompute time

No more queue issues!ETLRun many warehouseclusters off of samedata sets!FinanceBI

Elastic ComputeRe-size cluster in seconds!

Data SharingFivetranShare data acrosscompanies!Acme CoBob’s Plumbing

How does this affect ETL?

Recap of changesWarehouses20 0 0 OL AP20 0 6 On-premC olumn S tore MMPETLBI20 0 0 MonolithicR igid B I20 0 0 C us tom E T L20 0 8 S elf S erve B I20 11 Hadoop20 0 0 C loudC olumn S tore MMP20 15 C loud NativeC olumn S tore MMP20 13 C entralizedC loud Native S elfS erve B I?

Challenges with ETL from 2000’sETL was optimized for slow on -premise OLAP datawarehouses, with massive storage constraints.Optimized for pulling from on -premise enterpriseapplications

ExtensiveSetup

OngoingMaintenance

ExtensivePlanning

2015’s Shift in company structuresWith move to cloud, IT teams areshrinking.Analyst at the front of self serve BI andwant: S imple Infras tructure fully managed s ervices wholis tic control over s tack

Other changes since 2000Rise of cloud applicationsDrop in data storage 1GB 0.02Agile workflows

CLASSIC ETLExtract -Transform -Load -Visualize -MODERN DATA STACK- Extract- Load- Transform- Visualize

Agile Cloud -Native Self Serve AnalyticsFULL SCHEMA REPLICATION(DATA LAKE)- MODERN DATA STACK- Extract- LoadSTANDARDIZED SCHEMAS- TransformSQL-BASED MODELING &TRANSFORMATIONS- VisualizeCENTRALIZED &COLLABORATIVE

MODERN DATA STACKModularizereplication(Extraction & Load)from Transformation( Data gov )- Extract- Load- Transform- Visualize

MODERN DATA STACKSimplifies yourmanagement stack3 Teams- Extract- LoadProject ManagementAnalysts- TransformEngineeringITBusinessUsersExecutive - VisualizeTeam /Management

Recap of changesWarehouses20 0 0 OL AP20 0 6 On-premC olumn S tore MMPBI20 0 0 MonolithicR igid B IETL20 0 0 C us tom E T L20 0 8 S elf S erve B I20 11 Hadoop20 0 0 C loudC olumn S tore MMP20 15 C loud NativeC olumn S tore MMP20 13 C entralizedC loud Native S elfS erve B I20 15 E L TS eparate E L & T rans form

Zero Configuration, Zero Maintenance, Data Pipelines

Fivetran helps youachieve data acces s ibilitywith its zero configuration,zero maintenancedata pipelines

Data pipeline as a serviceAPPLICATIONSDATABASESFILESEVENTSYour warehouse

DatabasesApplicationsApple S earch AdsAs anaAdR ollB ing AdsB raintree P aymentsDes k.comDoubleC lickDynamics (3 6 5 , GP , AX)E loquaFacebook Ad Ins ightsFres hdes kFrontGithubGoogle AdwordsGoogle AnalyticsGoogle P layHelp S coutHubS n lNetSuite SuiteAnalyticsOptimizelyP ardotP interes t AdsQuickB ooks OnlineR eC hargeR ecurlyFor an updated list of data sources visit fivetran.com/directoryS ailthruSalesforceS ales forceIQS AP B us ines s OneS endGridS hopifyS tripeS ugarC R MT witter AdsXeroY ahoo GeminiZendeskZendes k C hat (Zopim)ZuoraAmazon AuroraAmazon R DSAzure S QL Databas eDynamoDBGoogle C loud S QLHerokuMariaDBMongoDBMySQLOracle DBPostgreSQLSQL ServerFilesAmazon C loudfrontAmazon K ines is F irehos eAmazon S3Azure B lob S torageC S V UploadDropboxE mail C S V Inges terFTPFTPSGoogle C loud S torageGoogle S heetsSFTPEventsGoogle Analytics 3 6 0S egmentS nowplowW ebhooks

Authenticate, and we do the rest.1Pull historical Data2Normalize3Create Schema/Tables& Load DataUpdate

Fivetran Data Normalization BehaviorSOURCEWe normalizeDenormalized Data(from APIs)We replicateNormalized Schemas(Databases, SFDC, Netsuite)WAREHOUSE

Standard Schemas - ERDs

Incremental Batch UpdatesSOURCEINSERTUPDATEDELETEWAREHOUSE

Automatic Schema MigrationsSOURCEADD COLUMNREMOVE COLUMNCHANGE TYPEADD OBJECTREMOVE OBJECTWAREHOUSE

Complexity compounds. Automate,s tandardize and s implify as much of yours tack as you can.

Recap of changesWarehouses20 0 0 OL AP20 0 6 On-premC olumn S tore MMPBI20 0 0 MonolithicR igid B IETL20 0 0 C us tom E T L20 0 8 S elf S erve B I20 11 Hadoop20 0 0 C loudC olumn S tore MMP20 15 C loud NativeC olumn S tore MMP20 13 C entralizedC loud Native S elfS erve B I20 15 E L TS eparate E L & T rans form

What’s coming next?Feedback, questions, thoughts?Taylor@fivetran.com

Zero Configuration, Zero Maintenance, Data Pipelines

The faster you see value, the more engaged your stakeholders will be Create a virtuous circle of improvement by spreading t