Microsoft Azure Virtual Training Day: Delivering The Modern Data

Transcription

Microsoft AzureVirtual Training Day:Delivering the modern datawarehouse

Delivering a modern datawarehouseNicholas MooreCloud Solutions Architect

Agenda

Why modern data warehousing?

Digital transformation91%of business leaders see Digital Transformation as away of sparking innovation and finding efficiencies68%say Digital Transformation is increasing profits85%say they must offer digital services orbecome irrelevant64%say they have less than 4 years tocomplete a Digital Transformationor they may go out of business

Common challenges with on-premises solutionsData SilosPerformance ConstraintsSolution ComplexityEscalating Costs

Derive real value from your data in the cloudOne hub forall dataUnlimiteddata scaleCommon PlatformLower TCO

Common customer use casesModern data warehouseAdvanced analytics“Integrate all our data—including BigData—with our data warehouse foranalytics and reporting”“Predict next best offer andcustomer churn”Real-time analytics“Derive insights from our devicesand data streams in real-time”

Building the modern data warehouse

Modern data warehouse patternsModern data warehouseAdvanced analytics“Integrate all our data—including BigData—with our data warehouse foranalytics and reporting”“Predict next best offer andcustomer churn”Real-time analytics“Derive insights from our devicesand data streams in real-time”

Modern data warehousing patternsIngest & PrepLogs(unstructured)Media(unstructured)Model & ServeAzure Data FactoryCode-free data transformation andingestion from 90 data integrationconnectorsAzure DatabricksAzure Synapse Analytics(Data Warehouse)Up to 14x faster and costs 94% lessthan other cloud providers(Data prep)Up to 10x faster than vanilla SparkFiles(unstructured)Business/custom apps(structured)VisualizeStoreAzure Data Lake StorageHigh performance data lakeavailable in all 54 Azure regionsPower BILeader in the Magic Quadrant for BusinessIntelligence and Analytics Platforms*

Ingest and PrepIngest & PrepLogs(unstructured)Media(unstructured)Model & ServeAzure Data FactoryCode-free data transformation andingestion from 90 data integrationconnectorsAzure DatabricksAzure Synapse Analytics(Data Warehouse)Up to 14x faster and costs 94% lessthan other cloud providers(Data prep)Up to 10x faster than vanilla SparkFiles(unstructured)Business/custom apps(structured)VisualizeStoreAzure Data Lake StorageHigh performance data lakeavailable in all 54 Azure regionsPower BILeader in the Magic Quadrant for BusinessIntelligence and Analytics Platforms*

Model and ServeIngest & PrepLogs(unstructured)Media(unstructured)Model & ServeAzure Data FactoryCode-free data transformation andingestion from 90 data integrationconnectorsAzure DatabricksAzure Synapse Analytics(Data Warehouse)Up to 14x faster and costs 94% lessthan other cloud providers(Data prep)Up to 10x faster than vanilla SparkFiles(unstructured)Business/custom apps(structured)VisualizeStoreAzure Data Lake StorageHigh performance data lakeavailable in all 54 Azure regionsPower BILeader in the Magic Quadrant for BusinessIntelligence and Analytics Platforms*

VisualizeIngest & PrepLogs(unstructured)Media(unstructured)Model & ServeAzure Data FactoryCode-free data transformation andingestion from 90 data integrationconnectorsAzure DatabricksAzure Synapse Analytics(Data Warehouse)Up to 14x faster and costs 94% lessthan other cloud providers(Data prep)Up to 10x faster than vanilla SparkFiles(unstructured)Business/custom apps(structured)VisualizeStoreAzure Data Lake StorageHigh performance data lakeavailable in all 54 Azure regionsPower BILeader in the Magic Quadrant for BusinessIntelligence and Analytics Platforms*

Advanced Analytics patterns

The evolving world of veCognitive

Advanced Analytics patternsModern data warehouseAdvanced analytics“Integrate all our data—including BigData—with our data warehouse foranalytics and reporting”“Predict next best offer andcustomer churn”Real-time analytics“Derive insights from our devicesand data streams in real-time”

Advanced Analytics patternsIngest & PrepLogs(unstructured)Media(unstructured)Azure Data FactoryCode-free data transformation andingestion from 90 data integrationconnectorsAzure Databricks(Data prep)Up to 10x faster than vanilla SparkFiles(unstructured)Business/custom apps(structured)Model & ServeVisualizeAzure Synapse Analytics(Data Warehouse)Up to 14x faster and costs 94% less than othercloud providersTrainAzure Databricks(Machine Learning)Up to 10x faster than vanilla SparkStoreAzure Data Lake StorageHigh performance data lakeavailable in all 54 Azure regionsPower BILeader in the Magic Quadrant for BusinessIntelligence and Analytics Platforms*

Real-time analytics patternsModern data warehouseAdvanced analytics“Integrate all our data—including BigData—with our data warehouse foranalytics and reporting”“Predict next best offer andcustomer churn”Real-time analytics“Derive insights from our devicesand data streams in real-time”

Real-time analytics patternsIngest & PrepLogs(unstructured)Media(unstructured)Azure Data FactoryCode-free data transformation andingestion from 90 data integrationconnectorsAzure Databricks(Data prep)Up to 10x faster than vanilla SparkFiles(unstructured)Business/custom apps(structured)VisualizeAzure Synapse Analytics(Data Warehouse)Power BIUp to 14x faster and costs 94% less than othercloud providersTrainLeader in the Magic Quadrant for BusinessIntelligence and Analytics Platforms*Azure Databricks(Machine Learning)Up to 10x faster than vanilla SparkStreamingStreamingStoreAzure Data Lake StorageAzure Event HubSensors andIoTModel & ServeHigh performance data lakeavailable in all 54 Azure regions

Demo:Modern Data Warehousingand Cloud Scale Analytics

The evolution of Cloud Scale Analytics

Azure Synapse AnalyticsLimitless data warehouse with unmatched time to insightsPREVIEWQUERYUnified experienceOn-premises dataCloud dataSynapse StudioIntegrationDevices dataSaaS dataManagementMonitoringSecurityPower BIAnalytics RuntimesSQLOn DemandSQL ProvisionedSTOREAzure Data Lake StorageSparkAzure Machine Learning

Azure Synapse rienceUnmatchedsecurity

Ingesting data for analyticsworkloadsNicholas MooreCloud Solutions Architect

Agenda

What is Azure Data Factory?

Azure Data FactoryA cloud-based data integration service that allowsyou to orchestrate and automatedata movement and data transformation.

Azure Data Factory processConnect & CollectTransform& EnrichPublishMonitor

Azure Data Factory ComponentsLinked ServiceData Lake meCFControlFlowActivitiesAzure DatabricksDataset

Ingesting data

Data transformation in AzureData preparationADF CopyActivityConnect &CollectLoad processed datainto tables optimizedfor analyticsRead data from filesusing DBFSAzure DatabricksData storageData ingestionLoad flat filesinto data lakeon a scheduleApplicationsAzure DataFactoryLogs, files, and media(unstructured)Azure Storage/Data Lake StoreVisualizeLoad into SQLDW tablesAzure Synapse AnalyticsPower BIDashboardsServingBusiness and customapps (structured)Applicationsmanage theirtransactionaldata directlySQL DBExtract andtransformrelational dataTransactional storageAzure DataFactoryData prep.

Copy Activity processReads data from a source data store.Performs serialization/deserialization, compression/decompression, columnmapping, and so on. It performs these operations based on the configurationof the input dataset, output dataset, and Copy activity.Writes data to the sink/destination data store

Integration RuntimeIRIRSQL ServerAzure IntegrationRuntimeSelf-hostedIntegration Runtime

Copy files with the Copy ActivitySupported file formats:TextJSONAvroORCParquetCopy activity can compressand decompress files withThe following codecs:GzipDeflateBzip2ZipDeflate

Transforming and enriching dataNicholas MooreCloud Solutions Architect

What is Azure Data Factory?

Azure Data FactoryA cloud-based data integration service thatallows you to orchestrate and automatedata movement and data transformation.

Azure Data Factory processConnect & CollectTransform& EnrichPublishMonitor

Azure Data Factory ComponentsLinked ServiceData Lake meCFControlFlowActivitiesAzure DatabricksDataset

Component dependencies

Transforming data withthe ADF Mapping Data Flow

Data transformation in AzureData preparationLoad processed datainto tables optimizedfor analyticsRead data from filesusing DBFSAzure DatabricksData storageData ingestionApplicationsLoad flat filesinto data lakeon a scheduleAzure DataFactoryLogs, files, and media(unstructured)Azure Storage/Data Lake StoreVisualizeLoad into SQLDW tablesAzure Synapse AnalyticsPower BIDashboardsServingBusiness and customapps (structured)Applicationsmanage theirtransactionaldata directlySQL DBExtract andtransformrelational dataTransactional storageAzure DataFactoryData prep.Transform& Enrich

Methods for transforming in Azure Data FactoryComputeresourcesSSIS PackagesMapping DataFlow

Methods for transforming data in Azure Data FactoryCode free data transformation at scaleComputeresourcesSSIS PackagesMapping DataFlow

Benefits of Mapping Data FlowCode free data transformation at scalePerform data cleansing, transformation,aggregations, etc.Enables you to build resilient data flows in a codefree environmentEnable you to focus on building business logic anddata transformationUnderlying infrastructure is provisionedautomatically with cloud scale via Spark executionMapping DataFlow

Using the Mapping Data FlowCode free data transformation at scale

Starting the Mapping Data FlowCode free data transformation at scale

Transformation options in the Mapping Data Flow

Triggering and monitoring

Triggering the Mapping Data FlowCode free data transformation at scale

Demo:Transforming your datain Azure Data Factory

In Summary:TransformingData withAzure DataFactory Microsoft Corporation

Demo:Ingesting data withAzure Data Factory

Monitoring data ingestion

MonitoringActivityPipelinerunsruns Microsoft Corporation

In Summary:IngestingData withAzure DataFactory Microsoft Corporation

Data Loading Best PracticesLuis SilvaCloud Solution Architect, Data Platform

Agenda

What is Azure Synapse Analytics?

Azure Synapse AnalyticsA limitless analytics service withunmatched time to insight, thatdelivers insights from all your data,across data warehouses and big dataanalytics systems, with blazing speed

Azure Synapse AnalyticsPaaSElastic ScaleBig agement

Data Warehouse ArchitectureCompute NodeControl NodeCompute NodeCompute 101101010111010101010110Compute NodeCompute NodeCompute 101101010111010101010110

SQL Pool ScalingDW2500c (5 compute nodes)ControlQueriesEngineDMSSQL MSSQL DBSQL DBSQL DBSQL DBSQL DBDist DB 1Dist DB 2Dist DB 13Dist DB 14Dist DB 25Dist DB 26Dist DB 37Dist DB 38Dist DB 49Dist DB 50 Dist DB 12Dist DB 24Dist DB 36Dist DB 48Dist DB 60Storage

SQL Pool ScalingDW5000c (10 compute nodes)ControlQueriesEngineDMSSQL MSSQL DBSQL DBSQL DBSQL DBSQL DBSQL DBSQL DBSQL DBSQL DBSQL DBDist DB 1Dist DB 2Dist DB 7Dist DB 8Dist DB 13Dist DB 14Dist DB 19Dist DB 20Dist DB 25Dist DB 26Dist DB 31Dist DB32Dist DB 37Dist DB 38Dist DB 43Dist DB 44Dist DB 49Dist DB 50Dist DB 55Dist DB 56 Dist DB 6Dist DB 12Dist DB 18Dist DB 24Dist DB 30Dist DB 26Dist DB 42Dist DB 48Dist DB 54Dist DB 60Storage

Data Warehouse ProcessesProvisionLoadQuery

Loading design goals

Load data efficientlyLoading design goalsLoad Data non-obtrusively, respectingconcurrent queries and loadsReduce table fragmentation as much aspossibleProvide system recovery capabilities in theevent of data load failure with minimalimpact on concurrent queriesMultiple methods of loading

Data warehousing loading in AzureLoading data into a data warehouse in Azure Synapse AnalyticsData preparationLoad processed datainto tables optimizedfor analyticsRead data from filesusing DBFSAzure DatabricksData storageData ingestionLoad flat filesinto data lakeon a scheduleApplicationsAzure DataFactoryLogs, files, and media(unstructured)Azure Storage/Data Lake StoreVisualizeLoad into SQLDW tablesAzure Synapse AnalyticsPower BIDashboardsServingBusiness and customapps (structured)Applicationsmanage theirtransactionaldata directlySQL DBExtract andtransformrelational dataTransactional storageAzure DataFactoryData prep.

Loading MethodsBCPSSISPolyBaseFile basedHeterogenousFile based

PolyBase benefitsBest practices for loading large amount of dataLeverages MPP architectureAzure Data Factory supportVariety of file formatsPolyBase is designed toleverage the MPP(Massively ParallelProcessing) architecture ofAzure Synapse Analyticsand will therefore load andexport data magnitudesfaster than any other tool.Azure Data Factory alsosupports PolyBase loadsand can achieve similarperformance to runningPolyBase manuallyPolyBase supports avariety of file formatsincluding RC, ORCand Gzip files.

Components of PolyBaseExternal DataSourceExternalTablesExternal FileFormat

Loading best practices

Manage your filesCompute NodeControl NodeCompute NodeCompute 101101010111010101010110Compute NodeCompute NodeCompute 101101010111010101010110

Reduce concurrent accessCompute NodeControl NodeCompute NodeCompute 101101010111010101010110Compute NodeCompute NodeCompute 101101010111010101010110

Create a dedicated load user accountCompute NodeControl NodeCompute NodeCompute 101101010111010101010110Compute NodeCompute NodeCompute 101101010111010101010110

Manage singleton updatesCompute NodeControl NodeCompute NodeCompute 101101010111010101010110Compute NodeCompute NodeCompute 101101010111010101010110

Optimize your loadsStaging data, a 2 step processAzure Synapse AnalyticsAzure Storage/Data Lake StoreLoad intoSQL DWtablesStaging TablesProductionTables

Create statistics after loadingImprove the query performance for usersAzure Synapse AnlyticsProduction Tables

Demo:Loading data intoAzure Synapse Analytics Data Warehouse

Optimizing data warehousingquery performanceLuis SilvaCloud Solution Architect, Data Platform

Agenda

What is Azure Synapse Analytics?

Azure Synapse AnalyticsA limitless analytics service withunmatched time to insight, thatdelivers insights from all your data,across data warehouses and big dataanalytics systems, with blazing speed

Data Warehouse ArchitectureCompute NodeControl NodeCompute NodeCompute 101101010111010101010110Compute NodeCompute NodeCompute 101101010111010101010110

Data Warehouse ProcessesProvisionLoadQuery

Data warehouse performance in Azure Synapse AnalyticsData preparationLoad processed datainto tables optimizedfor analyticsRead data from filesusing DBFSAzure DatabricksData storageData ingestionLoad flat filesinto data lakeon a scheduleApplicationsAzure DataFactoryLogs, files, and media(unstructured)Azure Storage/Data Lake StoreVisualizeLoad into SQLDW tablesAzure Synapse AnalyticsPower BIDashboardsServingBusiness and customapps (structured)Applicationsmanage theirtransactionaldata directlySQL DBExtract andtransformrelational dataTransactional storageAzure DataFactoryData prep.

Maximizing Performance

Maximizing Query PerformanceRound RobinTablesHash DistributedTablesReplicatedTables

Maximizing Query PerformanceIs the default option for newly created tablesRound-robinTablesEvenly distributes the data across the availablecompute nodes in a random manner, giving aneven distribution of data across all nodesLoading into Round-robin tables is fastQueries on Round-robin tables may requiremore data movement as data is “reshuffled” toorganize the data for the queryGreat to use for loading staging tables

Maximizing Query PerformanceHash DistributedTablesDistributes rows based on the value in thedistribution column, using a deterministic hashfunction to assign each row to one distribution.Is designed to achieve high performance forqueries that run against large fact tables in a starschema.Choosing a good distribution column isimportant to ensure the hash distributionperforms wellAs a starting point, use on tables that are greaterthan 2GB in size and has frequent inserts, updatesand deletedBut don’t choose a volatile column for the hashdistributed column

Maximizing Query PerformanceA full copy of a table is placed on every singlecompute node to minimize data movementReplicated TablesWorks well for dimension tables in a starschema that are less than 2GB in size and areused regularly in queries with simplepredicatesShould not be used on dimension tables thatare updated on a regular basisYou can convert existing round-robin tables toreplicated tables to take advantage of thefeature using a CTAS statement

Create statistics after loadingAzure Synapse AnalyticsProduction Tables

Demo:Creating distributed tables

Query Performance Tuning

Overcomes the 10,000-row limit of DMV’s outputPinpoint and fix queries with plan regressionQuery Data Store View queries which produce multiple plans 7-day retention period Full query textA/B Testing with your Azure Synapse Analytics (SQLDW)Identify, improve and tune ad hoc queries Top hitting queries for performance tuning Microsoft Corporation

Query Data StoreVIEW DATABASE STATEpermissionDMVs are in UTC timezone

Query execution with Query Data StoreCONTROLQueriesEngineQDSShell DBDMSFlush to disc every 15 minutes seconds10GB is the max storage sizeRetention period is 7 daysMaximum plans per query is 200ComputeComputeComputeComputeDMSDMSDMSDMSSQL DBSQL DBSQL DBSQL DBDist DB 15Dist DB 16Dist DB 31Dist DB 32Dist DB 46Dist DB 47Dist DB 1Dist DB 2 Dist DB 15Dist DB 30Dist DB 45Dist DB 60

Azure Synapse Analytics recommendationsRecommendationgeneration (every 24 hours)Dataskew ReplicatetablesAzure zure Advisor RecommendationBladeRecommendationAPI

Select the proper table distributionIn Summary:Query PerformanceDetect data skew Use Query Data store Consider changing key columns Only as fast as your slowest distributionProvision additional adaptive cache capacityReduce tempdb contentionCreate and update statistics Microsoft Corporation

Demo:Query Performance Tuning

Thank you!

SQL On Demand Spark Analytics Runtimes Synapse Studio Unified experience Integration Management Monitoring Security . PREVIEW Azure Analytics Limitless data warehouse with unmatched time to insights STORE Azure Data Lake Storage Azure Machine Learning SQL Provisioned. Azure Synapse Analytics Powerful insights Unmatched security Unified .