7 SNOWFLAKE REFERENCE ARCHITECTURES FOR

Transcription

7 SNOWFLAKE REFERENCEARCHITECTURES FORAPPLICATION BUILDERSFor every data app use case, there is a modern data architecture. Discover yours.EBOOK

3Why your data platform matters5Serverless data stack reference architecture6Streaming data stack reference architecture7Machine learning and data science reference architecture8Application health and security analytics9IoT reference architecture10Customer 360 reference architecture11Embedded analytics reference architecture12Future-proof your applications13About Snowflake

WHY YOUR DATAPLATFORM MATTERSIt’s safe to say data application builders will never worry about alack of data. Approximately 40 zettabytes (ZB) of new data wasgenerated in 2019, and IDC predicts that with a steady growthtrajectory, 175 ZB will be generated in 2025. Although these everincreasing amounts of data present immeasurable opportunities fordelivering data-driven insights to customers, there are three crucialquestions every startup and established ISV provider should ask:CAN OUR UNDERLYING ARCHITECTURE SCALE TO MEET THENEEDS OF OUR FAST-GROWTH BUSINESS?CAN OUR PRODUCT INGEST AND ANALYZE LARGE AMOUNTS OFSTRUCTURED AND SEMI-STRUCTURED DATA TOGETHER?MOST IMPORTANTLY, CAN WE ACCOMPLISH THESE GOALS WHILEREMAINING OPERATIONALLY EFFICIENT AND COST-EFFECTIVE?

The questions above highlight the intrinsic needfor a data stack architecture that has scalability,connectivity, and support for all data types builtinto its design. That means selecting cloud-builtinfrastructure components, the most important ofwhich is your data platform.As the central hub for all-things data, only a clouddata platform can deliver the performance andnearly infinite autoscaling needed to launch and scaleapplications quickly and cost-effectively. Here’s whatthe Snowflake Cloud Data Platform provides: SQL for all dataSnowflake ingests JSON, Avro, Parquet, and otherdata without transformations or requiring pipelinefixes every time the schema changes. With ANSISQL, Snowflake enables your teams to query semistructured data just as easily as structured data. This ebook provides detailed reference architecturesfor seven use cases and design patterns, and itdemonstrates the importance of a cloud-built dataplatform that matches scalability and connectivityexpectations, both today and in the future.CHAMPION GUIDESToday, too many organizations are burdened byinfrastructure costs that arise from traditionalarchitectures. When companies can achievescalability only by throwing more resources at theproblem, companies face an expensive and neverending problem. Traditional architectures are alsoriddled with operational overhead in the form ofmaintenance and tuning, which wastes valuableengineering time and slows growth. No Site Reliability Engineering/DevOps burdenAs a near-zero management platform, Snowflakeautomatically handles provisioning, availability,tuning, data protection, and other operations,which enables you to focus on your ownapplication rather than maintenance.Snowflake also ensures seamless connections tothird-party platforms and APIs, easily fitting in withyour existing environment. High performance and unlimited concurrencyThrough a multi-cluster, shared data architecture,Snowflake spins up dedicated compute clusters thatsupport a nearly unlimited number of concurrentworkloads on shared tables. There’s nevercontention for resources or an unhappy user. Scalability with true elasticitySnowflake compute resources scale up anddown automatically to deliver on-demand highperformance that’s cost-effective.4

OBJECTIVESERVERLESS DATA STACKDESCRIPTIONAPI Gateway ServiceAmazon API GatewayBuild data intensive applications thatrun on serverless infrastructures.Serverless ComputeNoSQL/OLTP DBAWS LambdaAmazon AuroraServerless1Serverless ETLAWS Step Functions2Asure Data FactoryAzure API ManagementGoogle CloudEndpointsApigee APIPlatformAzure FunctionsGoogle CloudFunctions1Google CloudComposerAzureCosmos DB42Google CloudDatastoreNative JSONSupportZeroManagementWorkload IsolationClient-sideApps33Google CloudDataflow45CHAMPION GUIDESSERVERLESS DATASTACK REFERENCEARCHITECTUREThe client-side app, running on mobile or webdevices, invokes the application logic on theserverless compute via an API gateway service.The gateway authenticates the API calls andthrottles them, based on SLAs.Serverless compute runs the application logicand scales on demand, without the need toprovision or manage servers. The applicationqueries Snowflake data (5) for runtime decisions,such as delivering product recommendations orpowering a dashboard for analysis.An OLTP or NoSQL database provides theapplication with high-capacity transactionprocessing. This NoSQL/OLTP database can alsobe a serverless service.An ETL serverless stack orchestrates the workflowand loads transaction data into Snowflake.Snowflake ingests data in batches or in streamsand makes it available to the application forqueries. Snowflake scales automatically tokeep pace with the data pipeline and ensuredata is always fresh. Workloads are isolatedin virtual warehouses where they can run andscale concurrently without resource contention.Native JSON support enables easy ingestionand querying of flexible schema data alongsidestructured data.Backend Apps& Services55

OBJECTIVESTREAMING DATA STACKDESCRIPTIONBuild data intensive applicationsthat rely on streaming data ingestionand analysis.11Streaming Services32AmazonKinesisCloudPub/SubTransformation UsingStreams & Tasks3Producer App2In-app AnalyticsAzure Event HubCloud Object StorageAuto IngestionUsing SnowpipeGoogle CloudStorageCHAMPION GUIDESSTREAMING DATASTACK REFERENCEARCHITECTUREThe producer application generates continuousdata that the streaming service ingests andbuffers to account for data rate differencesbetween the producer and consumers.Depending on the application’s needs, Snowflakeingests data directly from the streaming serviceor via cloud object storage (2).In cases where the application requires raw datato persist in cloud object storage, the streamingservice processes the raw data and batches itinto larger chunks, thus lowering the API storageexpenses. When Amazon Kinesis is used as thestreaming service, data is staged in cloud objectstorage before ingestion.Snowflake ingests data from the streamingservice into a staging table and stores thestreamed data for analysis. Its Streams and Tasksfeatures detect data changes and scheduletasks to perform any required transformations.Multiple streams and tasks can be chained toimplement a complex data pipeline. Snowpipewith Auto-Ingest automates the data ingestionfrom cloud object storage.InstantScalabilityAmazon S3Azure Blob Storage6

OBJECTIVEMACHINE LEARNING AND DATA SCIENCEDESCRIPTIONApps &Microservices1Train machine learning (ML) models tobuild predictive applications, such asrecommendation engines.1Model Deployment forBatch/Real-time Prediction25Streaming ServicesAmazonKinesis3Zero-copy Clonesfor feature ing Streams& TasksCloud orage4Model Training— Automated Training Platforms —4— Custom Training Platforms —AmazonSageMakerQuery Data inObject Storage viaExternal Tables3Google MLEngineAzure MLService— Machine Learning Libraries —5CHAMPION GUIDESMACHINE LEARNING ANDDATA SCIENCE REFERENCEARCHITECTUREThe application produces training data, whichSnowflake (3) ingests via the streaming serviceor via cloud object storage (2). The streamingservice buffers the training data to ensurereliable and continuous ingestion.When cloud object storage is used, thestreaming service batches training data intolarger chunks to lower the API storage expenses.Snowflake ingests data into a staging table.When new data is detected, the Streams andTasks feature schedule required transformations.Multiple streams and tasks can be chained toimplement a complex data pipeline. ExternalTables support queries of data in cloud objectstorage without ingestion. Data scientists cancreate zero-copy clones of the training data tosupport feature engineering and experimentation.Using the data stored in Snowflake, data scientiststrain models with ML platforms and availablelibraries. Once the model artifacts are trained,they are deployed on the training platforms or ona separate process (5) to support predictions.The application performs predictions in realtime or schedules batch predictions using thedeployed models. For batch predictions, data isread from an input table in Snowflake, and theresults are stored in an output table where theyare available to the application. In cases wheresubsecond response time is required, predictionscan also be performed using input data from thestreaming service.27

OBJECTIVEAPPLICATION HEALTH AND SECURITY ANALYTICSDESCRIPTION2Analyze large volumes of log data toidentify security threats and monitorapplication health.1Streaming Services4App/InfrastructureLogsScheduledTasks invokeSQL-based checksAmazonKinesisCloudPub/Sub165Message Service2EmailRule/ML BasedAlerting System(e.g. SnowAlert)3Amazon SNS4Log Collection& AggregationSystemsAzureEventHubCloud ObjectStorageAWS CloudTrailSMS/PushSnowpipew/Auto IngestCost Effectivelog storageMonitoringDashboards/Ad BlobStorageCHAMPION GUIDESAPPLICATION HEALTHAND SECURITY ANALYTICSREFERENCE ARCHITECTUREThe application and its infrastructure log largevolumes of event data that can be used tomonitor application health and detect maliciousbehavior. Log collection and aggregation systemscentralize log data from multiple sources anddeliver it to a streaming service (2) or to cloudobject storage (3).The streaming service buffers log data to ensurereliable and continuous ingestion.Depending on which log collector and aggregationsystem is used, data can be staged in cloud objectstorage without the need for a streaming service.Snowflake stores and analyzes the log data,which can be saved for long periods at commoditystorage prices. Snowpipe with Auto-Ingestautomates the ingestion from cloud objectstorage. Scheduled tasks invoke SQL-basedqueries to detect suspicious behavior orapplication health concerns.External rule-based alerting systems, such asSnowAlert, can detect suspicious activity orhealth concerns. Operations teams can monitorthe application via dashboards or ad hoc queries.A messaging service uses email, SMS, or pushnotifications to notify operations teams ofevents that require attention.SIEM systems can leverage data in Snowflake foradvanced searching and alerting capabilities.78

OBJECTIVEIOTDESCRIPTIONBuild applications that analyze largevolumes of time-series data from IoTdevices and respond in real time.13Streaming ServicesIoT RulesEngine62512IoTDevicesAmazonKinesisIoT MessageBrokerCloudPub/SubAWS IoTCoreMQTT3Native JSONSupportAggregation UsingStreams & TasksAzureEventHubAzure IoTHubCloud ObjectStorageCloud IoTCoreGoogleCloudStorage4IoT AnalyticsTime-series OptimizedData Ingestionwith Snowpipe5Query Data inObject Storage viaExternal Tables6HiveMQAmazonS34CHAMPION GUIDESIOT REFERENCEARCHITECTURESmart devices, sensors, and other IoT devicesgenerate continuous data.Due to frequently unreliable internetconnectivity, IoT devices communicate using theMQTT protocol and an IoT message broker. Themessage broker uses a publish and subscribemechanism to interact with other services, whichsubscribe to specific topics within the broker toaccess device data.A streaming service is used to ingest and bufferreal-time device data, thus ensuring reliableingestion and delivery to a staging table inSnowflake (5).In cases where the application requires it, cloudobject storage is used to stage batch data priorto ingestion. For example, minute-by-minutedata may be stored in cloud object storage,whereas aggregated data over a longer periodmay be stored in Snowflake (5).Snowflake offers native support for JSON andother semi-structured data formats for easyingestion of device data. Snowpipe automaticallyoptimizes time-series queries by ingesting datachronologically. Snowflake’s Streams and Tasksfeatures automate the workflows required toingest and aggregate incoming data.An IoT rules engine hosts the business logicrequired by the application and operates ondata available in Snowflake and in the messagebroker. The rules engine sends messages back tocontrols devices.AzureBlobStorage9

OBJECTIVECUSTOMER 360DESCRIPTIONBuild sales and marketing applicationsthat use historical and real-time data toaccomplish “360-degree view” customergoals, such as finding new segments andsending personalized offers.1Apps & Services463rd Party DataSnowflake Secure Data SharingStreaming ServiceAzureEvent HubAmazonKinesisCloudPub/Sub523User Activity DataGoogle CloudStorage3Azure Data Factory5Data Enrichment usingStreams & TasksGoogle CloudDataflowPurchase Attribution DataAmazonS3NativeJSONSupportETLProduct DataAudience Data24AWS StepFunctionsCloud Object Storage1ML Models6Query Data inObject Storage viaExternal TablesCHAMPION GUIDESCUSTOMER 360REFERENCEARCHITECTURECloud object storage stages application data,such as data on products, audiences, purchaseattributions, and user activity, for ingestion.A streaming service ensures reliable andcontinuous ingestion by buffering event data,such as clickstreams.ETL services orchestrate the workflow to loaddata from cloud object storage into Snowflake.Snowflake Secure Data Sharing enables datafrom third-party sources to be used withoutcopying or moving the data.Snowflake supports all the analytics workloadswithin the application. External Tables supportqueries of data in cloud object storage withoutingestion. The Streams and Tasks featuresautomate the ingestion and data enrichmentprocess. Native support for JSON and othersemi-structured formats simplifies the ingestionof event data. Secure Data Sharing enablesmonetization of fresh data without copying ormoving the data.ML models are trained to optimize offers basedon historical data stored in Snowflake. Theapplication makes real-time predictions via anAPI and uses Snowflake tables to store inputdata and batch prediction results.Data Monetizationvia SecureData SharingAzure BlobStorage10

OBJECTIVEEMBEDDED ANALYTICSDESCRIPTIONBuild analytics-heavy applications thatdeliver in-app visualizations.11API/Web TierIn-Memory Cache223MemcachedAppOLTP/NoSQL DB34ETL5In-App EmbeddedBusiness Intelligence45CHAMPION GUIDESEMBEDDED ANALYTICSREFERENCE ARCHITECTUREThe application makes requests via an APIor web tier, depending on whether APImanagement is required to enforce an SLA.In-memory cache provides in-session readrequests to ensure millisecond response time.An OLTP or NoSQL database supports thetransaction workloads of the application.Snowflake (4) ingests historical transactiondata via ETL infrastructure to supportanalytical workloads.Snowflake stores all historical data andsupports queries by the application andbusiness intelligence tools (5). Virtualwarehouses isolate workloads and autoscalecompute resources to deliver high performancequeries and unlimited concurrency.Embedded business intelligence tools or opensource charting libraries support analytics fromwithin the application.Workload IsolationFast AnalyticalQueriesNative JSONSupport11

Regardless of the type of applications youbuild or what architectural design pattern youselect, you must meet the core data platformrequirements for scalability and connectivityif you want to attract and keep customers togrow your business. With Snowflake, you canmeet customer expectations with a modernfoundation for your data stack that deliversa highly performant service, both now and inthe future.CHAMPION GUIDESFUTURE-PROOF YOURAPPLICATIONSRather than spend valuable development timerearchitecting your data stack over and over againto chase ever-evolving scalability needs, a cloud dataplatform lets you focus on what you do best:building and improving your application to enticenew customers.And that’s something you can hang your app on.12

ABOUT SNOWFLAKESnowflake’s cloud data platform shatters the barriers that have prevented organizations of all sizes from unleashing the true value from their data.More than 2,000 customers deploy Snowflake to advance their businesses beyond what was once possible by deriving all the insights from alltheir data by all their business users. Snowflake equips organizations with a single, integrated platform that offers the only data warehouse built forthe cloud; instant, secure, and governed access to their entire network of data; and a core architecture to enable many types of data workloads,including a single platform for developing modern data applications. Snowflake: Data without limits. Find out more at snowflake.com 2020 Snowflake. All rights reserved.CITATIONS1“The Digitization of the World From Edge to Core.” IDC. bit.ly/2QuFiKk

Dec 07, 2020 · Apps Backend Apps & Services AWS Lambda Azure Functions Google Cloud Functions Amazon Aurora Serverless Azure Cosmos DB Google Cloud Datastore Google Cloud Composer Dataflow Native JSON Support Zero Management DESCRIPTION 1 The client-side app, running on mobile or web d