FIVE CHARACTERISTICS OF A MODERN DATA PIPELINE

Transcription

FIVE CHARACTERISTICS OFA MODERN DATA PIPELINEFaster pipelines deliver insights faster

2Introduction4Continuous, extensible data processing4The elasticity and agility of the cloud5Isolated and independent resources for data processing6Democratized data and self-service management7High availability and disaster recovery8Your data pipeline checklist8A competitive advantage9About Snowflake

CHAMPION GUIDESINTRODUCTIONOnly robust end-to-end data pipelines willproperly equip organizations to source, collect,manage, analyze, and effectively use crucialdata to generate new market opportunities anddeliver cost-saving business processes.Traditional data pipelines are rigid, brittle, anddifficult to change, and they do not supportthe constantly evolving data needs of today’sorganizations.They present many challenges, by: Taking significant time and cost to design and build Comprising multiple tools that are not compatibleand require unnecessary integration Requiring that only qualified IT professionals, whohave skills in short supply, build data pipelines,thereby creating work bottlenecks Introducing avoidable latency, causing delayed dataextraction or transport through the pipeline Ignoring the demands of streaming data, handlingbatch-only data loading Being rigid, making it difficult to change andmanage over time Relying on schema-dependent dataloading processesModern data pipelines make it faster and easierto extract information from the data you collect.They start with extracting raw data from a numberof different sources. The data is then collectedand transported to a common place, typically adata repository in the cloud. From there, the dataundergoes various transformations until it’s usablefor analytics and produces business value. Thedata is then loaded into a data warehouse, whereit is easily managed and accessed by data scienceworkloads, automated actions, and other suchcomputing jobs.To address these issues, the best data pipelines havethese five characteristics: Continuous and extensible data processing The elasticity and agility of the cloud Isolated and independent resources for data processing Democratized data access and self-service management High availability and disaster recoveryThese characteristics enable organizations to leveragetheir data quickly, accurately, and efficiently to makequicker and better business decisions.2

CHAMPION GUIDESModern Data Pipeline ArchitectureBatchBI / sData Data pipelineStreamingWeb/log Data scienceHighlyavailable3

CHAMPION GUIDESFIVE CHARACTERISTICS OFA MODERN DATA PIPELINECONTINUOUS AND EXTENSIBLEDATA PROCESSINGThe scourge of stale dataTraditionally, organizations extract and ingest datain prescheduled batches, typically once every houror every night. But these batch-oriented extract,transform, and load (ETL) operations result in datathat is hours or days old, which substantially reducesthe value of data analytics and creates missedopportunities. Marketing campaigns that rely on evenday-old data could reduce their effectiveness. Forexample, an online retailer may fail to capture datathat reveals the short-term buying spree of a certaintype of product based on a celebrity discussing, using,or wearing the product.Many traditional ETL processes need a dedicatedservice window. However, they often conflict withexisting service windows or, worse, the windows arenonexistent. In addition, for global organizations active24 hours a day, a nightly window required for batchdata processing is no longer realistic. As data volumeand complexity rise, it becomes difficult for ETLprocesses to finish within a dedicated service window,causing ongoing performance issues and delaying keyinsights.Immediately actionable insightsOn the flip side, modern data pipelines continuouslyand incrementally perform the processes involved inloading data into a data warehouse, transforming itinto a usable form, and then analyzing it. The businesscan still use analytics tools such as Tableau, Looker, orMicrosoft’s PowerBI to run queries and reports, andthe results will be current and immediately actionable.THE ELASTICITY ANDAGILITY OF THE CLOUDContinuous processing decreases latency at everystep along the way and enables users and systemsto use data from a few minutes ago, instead of a dayago. With this kind of continuous processing, securityapplications can target and resolve external threats ona more timely basis, for example.Legacy data pipelines are run on premises usingcommodity hardware that is expensive to buyand manage. These solutions tend to be rigid andunable to scale easily. As a result, their creation andmanagement involves significant upfront budgetingfor peak-usage scenarios. When multiple workloadsrun concurrently, the competition for resourcesModern data pipelines can also incorporate andleverage custom code that is written outside theplatform. Using APIs and pipelining tools, they canstitch together a data flow using outside codeseamlessly, avoiding complicated processes andmaintaining the low latency.The cost of rigidityIn terms of decision-making, managers and analystscan make better business decisions based onmore-current data. Successful data-driven decisionmaking relies on the relevance of the data beinganalyzed, and just a few hours of lag time can make amajor difference for organizations that must operatein real time. For example, fleet management andlogistics companies need to correct dangerous drivingbehaviors and diagnose hazardous vehicleconditions before they cause an accident oran expensive breakdown.4

CHAMPION GUIDESincreases and performance degrades, hamstringingthe business with delayed insights and soaring costsduring peak times.Flexibility equals growthModern data pipelines offer the instant elasticityof the cloud and a significantly lower cost structureby automatically scaling back compute resources asnecessary. They can provide immediate and agileprovisioning when data sets and workloads grow.These pipelines can also simplify access to commonshared data, and they enable businesses to quicklydeploy their entire pipelines without the limits of ahardware setup. The ability to dedicate independentcompute resources to ETL workloads enables themto handle complex transformations without impactingthe performance of other workloads.With an elastic and agile data pipeline, businessescan better handle spikes and growth. For example,a seasonal business that experiences a sales spikeduring the holiday season can add storage andprocessing capacity within seconds, not days orweeks. In addition, elastic pipelines are primed tohandle compliance requests and audits more quickly.For example, the European Union’s General DataProtection Regulation (GDPR), which stipulateshow organizations can collect, store, and transmitthe personal data of EU residents, might require abusiness to run reports that demonstrate compliance.The flexibility of the cloud enables businessesto perform analytics without impacting sales orcustomer service.ISOLATED AND INDEPENDENTRESOURCES FOR DATAPROCESSINGSluggish performanceA common challenge with traditional ETL processesoccurs when workloads compete for resources.Running multiple workloads in parallel on the sameset of resources impacts the performance andresponse time of each, increasing the time from datacollection to insight. In addition, most platforms,whether in the cloud or on premises, use an older“shared nothing” architecture. This architecturetightly couples storage, compute, and databaseservices. The tight coupling hampers the ability ofthe database administrator to elastically scale thedatabase to store or analyze more data or to supportmore concurrent users.Gaining speed and valueBut imagine an architecture in which computeresources are separated into multiple independentclusters. In addition, the size and number of thoseclusters can grow and shrink instantly and nearlyinfinitely depending on the current load. All the while,each cluster has access to the same shared data setthat they jointly process, transform, and analyze. Suchan architecture has become crucial and cost-effectivefor today’s organizations, thanks to cloud computing.5

CHAMPION GUIDESA modern data pipeline that features an elasticmulti-cluster, shared data architecture makes itpossible to allocate multiple and independentisolated clusters for processing, data loading,transformation, and analytics while sharing the samedata concurrently without resource contention. Eachcluster can read and write data with full transactionalconsistency, and its size and resources are based onthe performance required for the workload at thattime. In addition, users can load data while it’s beingtransformed and analyzed in other clusters, withoutimpacting performance. Each workload has its owndedicated resources. Modern data pipelines alsorely on loosely coupled components that physicallyseparate but logically integrate storage, compute, andservices such as metadata and user management.Because each of the components is separate, theyexpand and contract independently of each other.Such an architecture frees organizations from painfultrade-offs, such as not being able to load and processadditional data sets due to the rigid capacity limits oftheir traditional data pipelines or running the risk ofviolating SLAs when accommodating more use cases.Or, they had to run the risk of violating the businessSLAs when accommodating more use cases. An elastic,multi-cluster, shared data architecture also makesprocessing times predictable because occasional spikesin data volume or load can be covered by instantly andelastically adding more resources.DEMOCRATIZED DATA ANDSELF-SERVICE MANAGEMENTThe inefficiency of ETLWith traditional solutions, the only way for multiplebusiness applications to pull from centralized data isto invest in tools that extract data from data marts,transform it into the proper format for querying, andthen load it into individual databases. ETL processestypically require a large set of external tools forextraction and ingestion. It often takes months for ateam of experienced data engineers to set up sucha process and integrate the tools, which createsbottlenecks. In addition, it requires yet more time toset up the process required for ongoing maintenance.Organizations often have to discontinue importantanalytics projects because they don’t have in-houseexpertise to create the data pipelines or the data isstale by the time the pipelines run. Much time is alsolost conceptualizing how the data pipelines shouldlook. In addition, the pipelines are unable to handleand process all types of data, whether structured,semi-structured, or unstructured.6

CHAMPION GUIDESIncreased access, better insightsA more secure wayModern data pipelines democratize data byincreasing users’ access to data and making it easierto conceptualize, create, and maintain data pipelines.They also provide the ability to manage all types ofdata, including semi-structured and unstructureddata. With true elasticity and workload isolation,and advanced tools such as zero-copy cloning, userscan more easily massage data to meet their needs.Businesses can use simple tools, such as SQL, toimplement parts of a pipeline. This makes the creation,management, and monitoring of the data processeslargely self-service, helping businesses to directlyinvestigate the data and discover insights, decreasingdecision-making time, and increasing business value.A modern data pipeline supported by a highly availablecloud-built environment provides quick recovery ofdata, no matter where the data is or who the cloudprovider is. If a disaster occurs in one region or withone cloud provider, organizations can immediatelyaccess and control the data they have replicated in adifferent region or with a different cloud provider.HIGH AVAILABILITY ANDDISASTER RECOVERYThe impact of downtimeIf an internet outage occurs due to network issues,natural disasters, or viruses, the financial impactof downtime can be significant. Corporate andgovernment mandates also require the durabilityand availability of data, and proven backup plans arenecessary for compliance. However, fully restoringdata and systems is time-consuming and leads to thepotential of lost revenue.7

CHAMPION GUIDESA COMPETITIVEADVANTAGEYour data pipeline checklistThe massive enterprise shift to cloud-builtsoftware services combined with ETLand data pipelines offers the potential fororganizations to greatly improve andsimplify their data processing.Companies that currently rely on batch ETLprocessing can begin implementing new continuousprocessing methodologies without disruptingtheir current processes. Instead of costly rip-andreplace, the implementation can be incremental andevolutionary, starting with certain types of data orareas of the business.As you review the myriad data pipeline optionsavailable, consider that great data pipelines enableyour business to gain a competitive advantage bymaking better, faster decisions. Just make sure yourdata pipeline provides continuous data processing;is elastic and agile; uses isolated, independentprocessing resources; increases data access; and iseasy to set up and maintain.Continuous and extensibledata processingThe elasticity andagility of the cloudDemocratized dataaccess and self-servicemanagementHigh availability anddisaster recoveryIsolated and independentresources for dataprocessingLearn more about Snowflake Data Pipelines.8

ABOUT SNOWFLAKESnowflake is the only data warehouse built for the cloud, enabling the data-driven enterprise with instantelasticity, secure data sharing, and per-second pricing across multiple clouds. Snowflake combines thepower of data warehousing, the flexibility of big data platforms, and the elasticity of the cloud at a fractionof the cost of traditional solutions. Snowflake: Your data, no limits. Find out more at snowflake.com. 2019 Snowflake, Inc. All rights reserved. snowflake.com #YourDataNoLimits

Feb 05, 2020 · 2 Introduction 4 Continuous, extensible data processing 4 The elasticity and agility of the cloud 5 Isolated and independent resources for data processing 6 Democratized data and self-service management 7 High availability and disaster recovery 8 Your data pipeline checklist 8 A competitive advantage 9 About Snowflake