Azure Data Landscape - WordPress

Transcription

Azure Big Data LandscapeA high level overview of the big data services on the Azure cloud platformRolf TesmerAzure Data Solutions Architect, Microsoft

Why is data so important?Because there’s just so much of it!CLOUDMOBILE

On-Prem vs IaaS vs PaaS vs SaaS – Which One?

* Preview Serviceshttp://azureplatform.azurewebsites.net/

AgendaKey Components of the Microsoft AzureCloud Data Platform

Introduction: Data Size Over the Years

Introduction: Big Data Definition - The Four V’sA Big Data “problem” exists when you must address more than one of the V’s.(Only one V indicates current technology is likely to satisfy your goals)VolumeThe data exceeds thephysical limits of verticalscalability, implying a scaleout solution (vs. scalingup).VelocityThe decision windowis small compared withthe data change rate.VarietyVariabilityMany differentformats makeintegration difficultand expensive.Many options orvariableinterpretationsconfound analysis.To solve the “problem” you often need specialist technologiesBusiness wish to solve the “problem” because it offers competitive advantage

Big Data Business Applications & Use CasesFraud prevention360 view of the customerAnalyze brand sentimentLocalized, personalizedpromotionsWebsite optimizationNext product to buy (NPTB)

Big Data and Data Warehousing ComparedBig Data does not negate the business drivers for a Data Warehouse. The technologies servedifference business purposes. Big Data systems can be a feeder into the Data Warehouse.FeatureBig DataData Warehousing(ADL, HDInsight, Hadoop, etc) (SQL DW, SQL in IaaS)Solution TypeEcosystem, not a productProduct/ServiceTypical Data TypeStructured, Semi-Structured, Unstructured Structured (Operational)Typical Data SizeTB – PBLinear Scale out MPPGB – TBNon-linear, Scale Up (SMP typically!)Typical Data ArtefactsFilesTables/Rows/ColumnsSchemaDefined On ReadDefined On WriteData Consistency,Quality and AccuracyLow, loose structure, no ACIDHigh, complex structure, strong ACIDAzure TechnologiesHDInsight, Data LakeVendors (Cloudera, MapR, Hortonworks)SQL DB, SQL DWSQL Relational Database in IaaS

MICROSOFT BIG DATA SOLUTIONSBig Data as part of a Data Warehousing SolutionCloudRelationalBeyond relationalFastest insightsAzure SQL Data WarehouseAzure Data LakeReal-time insights with breakthrough query performanceSQL Server in Azure VMsAzure HDInsightAnalytics built-inAzure MarketplaceHortonworks, Cloudera, MapRReal-time insights with analytics built inChoice of deploymentLeading solutions—on-premises and in the cloudPolyBaseLayers of securityInsightsLeast vulnerable database 6 years in a rowOn-premisesAny data, any scaleA hybrid solution that grows in step with customer needsMore for the priceSQL Server 2016Analytics Platform System(APS)3rd Party Hadoop DistributionsHortonworks, ClouderaCustomers do more with industry-leading TCO

AgendaLambda Architecture

BATCH LAYERSERVING LAYERSPEED LAYERWhat is the LAMBDA elemetry

Big Data Pipeline and Data Flow in Azure

AgendaWhat exactly is Unstructured, SemiStructured and Structured Data?

Considering Data Typesefficient data compression and encoding schemes with enhanced performance to handle complex data in ile-formats-its-not-just-csv-anymore

Columnar Formats: Why? ORC & PARQUET

Columnar Formats: Options

Query Times for Different FormatsReference: Unknown

Data Size for Different Formats & CompressionReference: Unknown

AgendaWhat exactly is Hadoop?

Introduction: What is Hadoop?A platform with a portfolio of projectsGoverned by Apache Software Foundation (ASF) (Open Source)Comprises core services of MapReduce, HDFS, and YARNGovernance & IntegrationData workflow,lifecycle andgovernanceFalconSqoopFlumeNFSWebHDFSData accessBatchScriptSQLNosqlStream SearchMapreducePigHive/Tez, HbaseStormHCatalog AccumuloSolrOthersSpark,in-memory,ISV enginesYARN: data operating system1 HDFS (Hadoop Distributed File System) (3xreplicas) Data management ntingData protectionProvision, manage,and monitorStorage: HDFSResources: YARNAccess: Hive, Pipeline: FalconCluster: KnoxAmbariZookeeperScheduling NOozie

The various big data solutionsCONTROLEASE OF USEAzure Data LakeAnalyticsAny Hadoop technologyIaaS HadoopWorkload optimized,managed clustersManaged HadoopSpecific apps in a multitenant form factorBig Data as-a-serviceAzure Data LakeAnalyticsAzure Data Lake StoreAzure StorageBIG DATASTORAGEAzure MarketplaceHDP CDH MapRBIG DATAANALYTICSAzure HDInsight

Context - Comparing Hadoop and SQL Server\SQL OS/ ImpalaSpark(ie Create tables, etc.)In Memory SQL Stored ProceduresReference: “Eating the Elephant” – PASS 2015 - Stuart R Ainsworth

Physical Structure Name node is critical – if down, cluster is down

Data Redundancy

AgendaKey Components of the Microsoft AzureCloud Data Platform

Azure HDInsightHadoop as a Service on Azure(PaaS)FULLY MANAGED AND SUPPORTED PaaSHadoop, Spark, Hbase, Storm, KafkaAvailable on LINUX100% OPEN SOURCE Apache HadoopClusters up and RUNNING IN MINUTES (20-30)Use familiar BI TOOLS FOR ANALYSIS like Excel

HDInsight: Azure PaaS Implementation of HadoopHDInsight Supports Several of the Hadoop Projects HIVE HiveQL is a SQL-like language (subset of SQL)(Compiled into MapReduce jobs)HBASE Columnar, NoSQL database on data in HDFSSPARK In Memory Processing on Multiple WorkloadsSTORM Stream Analytics for Near-Real TimeProcessing (similar to Azure Stream Analytics)

HDInsight Supports HiveSQL-like queries on Hadoop data in HDInsightHDInsight provides easy-to-use graphical query interface for HiveHiveQL is a SQL-like language (subset of SQL)Hive structures include well-understood database concepts such as tables, rows, columns, partitionsCompiled into MapReduce jobs that are executed on HadoopDramatic performance gains with Stinger/TezStinger is a Microsoft, Hortonworks and OSS driven initiative to bring interactive queries with HiveBrings query execution engine technology from Microsoft SQL Server to HivePerformance gains up to 100xHadoop 2.0

HDInsight Supports SparkIn Memory Processing on Multiple WorkloadsSingle execution model for multiple tasks (SQL queries, Streaming, Machine Learning, and Graph)Processing up to 100x faster performanceDeveloper friendly (Java, Python, Scala)BI tool of choice (Power BI, Tabelau, Qlik, SAP)Notebook experience (Jupyter/iPython, Zeppelin)

HDInsight Supports HBaseNoSQL database on data in HDInsightColumnar, NoSQL databaseRuns on top of the Hadoop Distributed File System (HDFS)Provides flexibility in that new columns can be added to column families at any timeHMasterCoordinationName NodeRegion ServerRegion ServerRegion ServerRegion ServerJob TrackerData NodeData NodeData NodeData NodeTask TrackerTask TrackerTask TrackerTask Tracker

HDInsight Supports StormStream analytics for Near-Real Time ProcessingConsumes millions of real-time events from a scalable event broker (ie. Apache Kafka, Azure Event Hub)Performs time-sensitive computationOutput to persistent stores, dashboards or devicesCustomizable with Java .NETDeeply integrated to Visual StudioWeb/thick clientdashboardsRabbitMQ /ActiveMQStreamprocessingSearch and queryData analytics (Excel)Devices to take action

AgendaKey Components of the Microsoft AzureCloud Data Platform

Introduction: What is Azure Data Lake Store & Analytics?Microsoft Azure Data LakeADL AnalyticsHDInsightU-SQLConsists of 2 component parts;Data Lake Store & Data Lake AnalyticsDistributed PaaS serviceBoth Instantly scale to meet performance needsYARNHDFS and ADLADL StoreAnalytics over all data(unstructured, semi-structured, structured)U-SQL to perform Analytics(simple and familiar, easily extensible)(Integrated into Visual Studio tools)Built on open standards (YARN)Can deploy other services on store (ie HDInsight)

Azure Data LakeStoreA hyper scale repository for bigdata analytics workloadsAn enterprise wide repository ofevery type of data collected in asingle place prior to any formaldefinition of requirements orschema.SCALE No limitsANY DATA Store in its native formatHADOOP FILE SYSTEM (HDFS) for the cloudNATIVELY accessible via both HDFS and ADLENTERPRISE READY access control, encryptionPERFORMANCE Optimized for analytic workloadPaaS Service managed by Microsoft

Azure Data Lake Store – Technical DetailsDurable & Highly Available Data is managed by Microsoft (PaaS)Unlimited Storage Unlimited account sizes, no limits to scale Individual file sizes to PBsSecure Secure files and folders, POSIX (ACL) Auditing and logging Encryption at restOptimised for Analytic Workloads Designed for large scale parallel processing Auto optimize to match active workloads Immediate read after writePrimary Use Cases Long term IoT storageClickstream analysisSocial analysisWeb log analysisFile based batch processingStaging files for DW loadsLong term DW archive( similar use cases to Big Data)

AUTO SCALE with no limitsAzure Data LakeAnalyticsA elastic analytics servicebuilt on Apache YARN thatprocesses all data, at any sizeU-SQL a language that unifies the benefits of SQLwith the expressive power of C#Optimized to work with ADL STOREFEDERATED QUERY with Azure data sourcesENTERPRISE READYPay & Auto Scale PER (U-SQL) ANALYTIC JOBDEVELOP jobs in Visual Studio or Azure PortalU-SQL Reference: 91959.aspxExample Code: 01/04/an-introduction-to-u-sql-in-azure-data-lake/

AgendaComparing How the TechnologiesOverlap

Hot DataCold DataStructureHDFS(HDI, HDP, Cloudera, MapR)NoSQL(CosmosDB)Azure Data Lake Store (ADLS)SQL(SQL DB, SQL DW, SQL IaaS)Cost/GBLatencyData VolumeReference: “Architect Robust Big Data Solutions with Azure Data Lake” – Matt Winter, Ignite Australia 2017HighLowRequest RateLowHighHigh

Data Processing Technology ChoicesReference: “Architect Robust Big Data Solutions with Azure Data Lake” – Matt Winter, Ignite Australia 2017

AgendaQ&A

References Big Data - aspxHadoop - https://en.wikipedia.org/wiki/Apache HadoopMap Reduce - https://en.wikipedia.org/wiki/MapReduceHive - https://en.wikipedia.org/wiki/Apache HiveSpark (core, streaming, ML, graphX) - https://en.wikipedia.org/wiki/Apache SparkStorm - https://en.wikipedia.org/wiki/Storm (event processor)Kafka - https://en.wikipedia.org/wiki/Apache KafkaSqoop - https://en.wikipedia.org/wiki/SqoopImpala - https://en.wikipedia.org/wiki/Cloudera ImpalaCloudera - https://en.wikipedia.org/wiki/ClouderaHortonworks - https://en.wikipedia.org/wiki/HortonworksData Lake - https://en.wikipedia.org/wiki/HortonworksHDInsight - aspxMahout - https://en.wikipedia.org/wiki/Apache MahoutAvro - https://wiki.apache.org/hadoop/Avro/Parquet - https://en.wikipedia.org/wiki/Apache ParquetORC - https://orc.apache.org/docs/SPARK vs IMPALA – which and when - la/Patterns and Practices - aspx

Azure Data Lake Azure HDInsight Azure Marketplace Hortonworks, Cloudera, MapR Azure SQL Data Warehouse SQL Server in Azure VMs On-s oud SQL Server 2016 Analytics Platform System (APS) 3rd Party Hadoop Distributions Hortonworks, Cloudera PolyBase Insights Fastest insights Real-time insigh