BIG DATA PROCESSING A DEEP DIVE IN HADOOP/SPARK &

Transcription

BIG DATA PROCESSINGA DEEP DIVE INHADOOP/SPARK& AZURE SQL DWPresented By: Orion GebremedhinDirector of Technology, Data & Analytics,Neudesic LLC.Data Platform VTSP, Microsoft Corp.@OrionGM

TOPICSCOVERED1Fundamentals of Big Data Platforms2Major Big Data Tools

ScalingUp vs. OutSCALE UP (SMP)SCALE OUT (MPP) (n)Upgrade components or buy bigger server each timeMultiprocessor system where processors share resources : Operating System (OS) Memory I/O devices connected using a common busAdd nodes to the cluster Multiple processing nodesOSRAMNetwork

InnovationTimelineFastest sort of a TB, 3.5 minsover 910 nodesDoug Cutting adds DFS &MapReduce support to NutchNY Times converts 4TB of imagearchives over 100 EC2sDoug Cutting & Mike Cafarellastarted working on Nutch200220032004Google publishes GFS &MapReduce papers200520062007Fastest sort of a TB,62 secs over 1,460 nodesSorted a PB in 16.25 hoursover 3,658 nodes2008Yahoo! Hires Cutting,Hadoop spins out of NutchFacebooks launches Hive:SQL Support for HadoopFounded2009Doug Cuttingjoins ClouderaHadoop Summit 2009,750 attendees

THEFUNDAMENTALSOF HADOOP Hadoop evolved directly fromcommodity scientific supercomputingclusters developed in the 1990s Hadoop consists of: MapReduce Hadoop Distributed File System(HDFS)

WHAT’SNEW

BASICS OFMPP400 bills1 bill/ sec 400 Seconds

BASICS OFMPP200 bills1 bill/ sec 200 Seconds200 bills1 bill/ sec 200 SecondsTotal 200 Seconds

BASICS OFMPP100 Bills1 bill/ sec 100 Seconds100 Bills1 bill/ sec 100 Seconds100 Bills1 bill/ sec 100 Seconds100 Bills1 bill/ sec 100 SecondsTotal 100 Seconds

HDFS &MAPREDUCEThe Main Node: runs the Job tracker and the name node controls thefiles.Each node runs two processes: Task Tracker and Data NodeJob TrackerTask TrackerTask TrackerName NodeData NodeData Node1NMap ReduceHDFS Cluster

BASICS OFMAPREDUCEThe Main Node: runs the Job tracker and the name node controls thefiles.Each node runs two processes: Task Tracker and Data NodeQueryData Nodes/TaskTrackersResultQueryName Node/Job Tracker

EXECUTION UNITSMAPREDUCEThe overall MapReduce word count processInputSplittingMappingShufflingReducingFinal Result

SOME DISTRIBUTIONS OFAPACHEHADOOPApache Foundation

SandboxHortonworks

MAPREDUCEPIG & HIVEMAPREDUCEPIG Java Mostly used by Yahoo Write many lines of code Most used for data processing Shares some constructs w/ SQL Is more Verbose Needs a lot of training for userswith limited proceduralprogramming backgroundHIVE Mostly used by Facebook foranalytic purposes Used for analytics Relatively easier for developersw/ SQL experience Less control over optimization ofdata flows compared to Pig Offers control over the flow ofdataNot as efficient as MapReduceHigher productivity for data scientists and developers

THEEXPLOSION OFHADOOP

THEHISTORY OFSPARKMapReduceTop Level2006200420102009201320112014Spark PaperBSDOpen Source17Apache

SPARKSHAREDLIBRARIES

SPARKTHE UNIFIED PLATFORMFOR BIG DATAAPIs for : Scala Java Python RSparkSQLSparkStreamingMLlib(machinelearning)Spark CoreGraphX(graph)

SPARKBENEFITSPerformanceUsing in-memorycomputing, Spark isconsiderably faster thanHadoop (100x in sometests).Can be used for batchand real-time dataprocessing.Developer ProductivityEasy-to-use APIs forprocessing largedatasets.Includes 100 operatorsfor transforming.Unified EngineEcosystemIntegrated frameworkincludes higher-levellibraries for interactive SQLqueries, processingstreaming data, machinelearning and graphprocessing.Spark has built-insupport for many datasources such as HDFS,RDBMS, S3, ApacheHive, Cassandra andMongoDB.A single application cancombine all types ofprocessing.Runs on top of theApache YARN resourcemanager.

ANALYTICSCORTANA

SQL ServerBIG DATA OPTIMIZATIONS

SQL ServerAPSHigh-Speed InterconnectNode 1Node 2Node 3A FG RS Z

SQL ServerAPS GROWTH TOPOLOGYScale Unit Base UnitBase Unit Extension

SQL ServerAzure SQL DWSQL DBApplication or User ConnectionData Loading(Poly Base, ADF, SSIS, REST, OLE, ODBC, ADF, AZCopy, PS)SQL DBDMSComputeNodeDMSComputeNodeMassively Parallel Processing (MPP) EngineSQL DBDMSSQL DBComputeNodeDMSComputeNodeSQL DBDMSComputeNodeAzure Infrastructure and StorageBlob Storage [WASB(S)]

SQL ServerDEPLOY OPTIONS & HYBRID SOLUTIONS

SQL ServerCONNECTING ISLANDS OF DATA WITH POLYBASESelect Microsoft AzureHDInsightHortonworks forWindows and LinuxResult set SQL ServerParallel DataWarehouseProvides a single T-SQL query model for PDWand Hadoop with rich features of T-SQL,including joins without ETL Uses the power of MPP to enhance queryexecution performancePolyBase Supports Windows Azure HDInsight to enablenew hybrid cloud scenarios Provides the ability to query non-MicrosoftHadoop distributions, such as Hortonworksand ClouderaClouderaMicrosoftHDInsight

USE CASE: SUPPLY CHAIN MANAGEMENTUse Case 3: Supply Chain Management

USE CASE:SMART GRIDMANAGEMENT

USE CASESSMART LING

Neudesic partnered with one of the nation’s largest utility companies that recentlydeployed Smar Utility Meters for power customers, nearly a million meters sending usagedata every 15 minutes.The result: an Azure hybrid big data processing solution that enabled the customer toperform gap analytics: a process for identifying gaps that exist in the power usagereadings, over 7x faster than their previous solution! Billions of Smart Meter reads getprocessed to identify the nature and duration of the gaps to mitigate revenue losses.

USE CASESSMART GRID

USE CASE:REAL TIMETRAFFIC ANALYSIS

REAL TIMETRAFFIC ANALYSIS

USE CASESSTREAM ANALYTICSReal-time frauddetectionConnected carsClick-stream analysisReal-time financialportfolio alertsSmart grid, energymanagementCRM alerting sales tocustomer caseData and identityprotection servicesReal-time salestracking

ML PROBLEMSSOLVED BY AZURE tionClustering

INDUSTRY USE CASESMACHINE LEARNING

THEMACHINE LEARNING WORKFLOWInput dataData transformationDefine modelSplit dataTrain modelScore (prediction)Evaluate Model

AZUREDATA FACTORY

HDINSIGHT

BIG DATA &Advanced AnalyticsRoadshowQuestions?Orion GebremedhinOrion.Gebremedhin@Neudesic.comTwitter: @oriongmMarc LobreeMarc.Lobree@Neudesic.com

Supports Windows Azure HDInsight to enable new hybrid cloud scenarios Provides the ability to query non-Microsoft Hadoop distributions, such as Hortonworks and Cloudera SQL Server Parallel Data Microsoft Azure Warehouse HDInsight PolyBase Microsoft HDInsight Hortonworks for Windo