Apache Spark 2.0 GA - Cloudera

Transcription

Apache Spark 2.0 GAThe General Engine for Modern Analytic Use Cases Cloudera, Inc. All rights reserved.1

Apache Spark Drives BusinessInnovationApache Spark is driving new businessvalue that is being harnessed bytechnology forward organizations.Driving Customer Insights Next Best Offer (Machine Learning)Churn AnalysisClick-Stream (Stream Processing)Drive CustomerInsightsImproving Products and Service Efficiencies Streaming from IOT SourcesConnected Products/Services AnalysisProactive/Predictive MaintenanceLower Business Risks Risk Modeling & AnalysisNetwork Threat DetectionImprove Product &Service EfficiencyLower BusinessRisk Cloudera, Inc. All rights reserved.2

Spark Addresses Common LimitationsAccess and UsabilityOne of the key advantages of ApacheSpark is the intuitive and flexible APIfor big-data processing, available inpopular programming languages.Prior to Apache Spark, users hadaccess to very limited in-flexibleabstractions for processing largedistributed data, with poor supportoutside java.Data ProcessingPerformanceMapreduce made big strides inenabling cost effective batchprocessing of large volumes ofdata. However, businessescontinue to see a need to shortendata processing windows andconsume data faster, requiring anew framework with significantlybetter performance.Machine Learning atScaleData Science and Machine Learningon big-data are exciting areas offocus. However that requireslibraries and that enable buildingmodels on large distributed data andAPIs that allow flexible exploration ofdata. Cloudera, Inc. All rights reserved.3

Apache SparkFast and flexible general purpose data processing for HadoopDataEngineeringStreamProcessingData Science &MachineLearningUnified API and processing Engine for large scale data Cloudera, Inc. All rights reserved.4

Spark Use CasesTop Use Cases Data Processing (55%), Real-Time Stream Processing(44%), Exploratory Data Science (33%) and Machine Learning (33%).3 out of 8 are employing Spark in data science research Cloudera, Inc. All rights reserved.5

Why Spark at Cloudera?The Most Apache Spark ExperienceCloudera is the “stress free” choice for SparkBATCHSpark, Hive, PigMapReduceSupport: Proactive Support for Spark workloads Expertise: Most Spark users trained. Robust developmentcommunity. Experience: First to ship and support. Most customers runningSpark of any commercial Hadoop Distribution. PROCESS, ANALYZE, SERVESTREAMSQLSEARCHSDKSparkImpalaSolrKiteUNIFIED SERVICESRESOURCE MANAGEMENTSECURITYYARNSentry, RecordServiceCloudera lives where your data lives REDUNSTRUCTUREDSqoopKafka, FlumeINTEGRATERun Spark On-prem or in the Public CloudCloudera makes Spark enterprise hardened Comprehensive Management and AlertingEnd to End Security and GovernanceBetter Multi-tenancy operation for multiple workloadsOut-of-the-box ready for end to end use cases Spark with supported seamless integrations with other big-datatools (Kafka, Hbase, Kudu, etc) Cloudera, Inc. All rights reserved.6

Spark from Cloudera57% have adopted Cloudera Spark for their most important use case,vs. 26% Hortonworks, 22% an Apache download, and 7% Databricks48% of respondents said they most commonly use Spark with HBaseand 41% of respondents said they use Spark with Kafka**Source: Tejena Group Apache Spark Market Survey 2016 ch#.WCCdPC0rK70 Cloudera, Inc. All rights reserved.7

The One Platform InitiativeManagementSecurityLeverage Hadoop-nativeresource managementFull support for Hadoop securityand beyondStreamingScalePerformance, simplification & easymanagement of streaming workloadsSpark at Petabyte scaleCloudElastic transient workloads Cloudera, Inc. All rights reserved.8

Three Core Enterprise ApplicationsDataEngineering& ScienceAnalyticDatabaseBuild data-drivenapplications to deliverreal-time insightsELT, reporting, exploratorybusiness intelligenceOPERATIONSPROCESS, ANALYZE, SERVEUNIFIED SERVICESSTOREINTEGRATEDATA MANAGEMENTProcess data, develop &serve predictive modelsOperationalDatabase Cloudera, Inc. All rights reserved.9

Cloudera’s Data Engineering SolutionData ScienceWorkbenchComing SoonNavigatorCollaborative andSecure Data ScienceWorkbenchAudit, lineage,encryption, keymanagement, & policylifecyclesHive-onSparkLarge-scale ETL & batchprocessing engineSearchInteractive search andimmediate explorationCloudDeploymentEasy deployment andflexible scalingSparkModern Real-timeAnalytics Engine10 Cloudera, Inc.All rights reserved.Multi-Storage,Multi-Environment

Data Processing Cloudera, Inc. All rights reserved.11

Common LimitationsPoor Cloud DesignETL and Batch Processing workloadsneed to utilize large amounts ofcompute but for only a window oftime. This causes organizations to overprovision to meet demands of the jobwhile the environment lays dormant amajority of the time producing poorROI.Poor PerformanceETL and data processing takes toolong and often excludes importantdata sources that are needed toextract real value from datacollected. Traditional platformsonly leverage structured data butincreasingly the data needed tooffer true intelligence varies informat and delivery.Limited Data FormatsTraditional platforms only leveragestructured data and require astrategic approach to schema design.Introducing new data (unstructured,time series, nested, log data) is oftencomplex if not impossible This causesanalysis to be limited to only dataextracted from core systems. Cloudera, Inc. All rights reserved.12

Data Processing with SparkProcess large scale unstructured and structured data in the same applicationPowerful and flexible higher order functions for arbitrary processing ofstructured or unstructured data mapflatMapfilterunion reduceByKeygroupBydistinctintersection cartesiancogroupsortByKeyaggregateByKey repartitionpartitionBycoalescepipe partitionBymapWithcountByKeyforeach.Keeping it simple: SQL for common operations on structured data Optimized execution by query processing engineSeamlessly mix SQL and higher-order functions Within the same Scala, Java or Python Spark application Cloudera, Inc. All rights reserved.13

Machine Learning Cloudera, Inc. All rights reserved.14

Machine LearningIn A Recent MIT Study, Respondents evaluated use cases for machine learning76% used machine learning to target higher sales growth40% used them to improve sales and marketing performance10% used machine learning to increase product sales and reduce churn.Enterprises are using machine learning to better serve their customers withhigher relevance.Machine Learning models need to scale and that is where the power of ClouderaEnterprise excels.** Source: Forbes Online Machine Learning Is Redefining The Enterprise In 2016 Cloudera, Inc. All rights reserved.15

Apache Spark MLlibCollection of mainstream machine learning algorithms built on SparkIncluding:Classifiers: logistic regression, boosted trees, random forests, etc Clustering: k-means, Latent Dirichlet Allocation (LDA) Recommender Systems: Alternating Least Squares Dimensionality Reduction: Principal Component Analysis (PCA) and Singular ValueDecomposition (SVD) Feature Engineering & Selection: TF-IDF, Word2Vec, Normalizer, etc Statistical Functions: Chi-Squared Test, Pearson Correlation, etc Cloudera, Inc. All rights reserved.16

Real Time Analysis Cloudera, Inc. All rights reserved.17

Spark StreamingReal-time and continuous processing of data streams Fault-tolerant and high-performance processing of continuous streams of data Similar API and programming paradigm for batch and stream processing Express complex processing logic on data streamsFocus on the processing logic, instead of stream topologyRe-use code across batch and streaming jobsSimplified APIs for common streaming tasks: High throughput with sub-second latencyOperations on “Rolling Windows”Maintain and update arbitrary state for streaming eventsIncremental aggregationsCombine with MLlib for Predictive Analytics on streaming data Cloudera, Inc. All rights reserved.18

Spark Adoption64% of current adopters plan to increaseApache Spark usage over the next 12 monthsSpark deployment in public cloud is projected to increasefrom 23% today to 36% in the future Cloudera, Inc. All rights reserved.19

Spark in the Cloud Cloudera, Inc. All rights reserved.20

Why Cloudera for Spark in the Cloud?Rely on the most portable, cost-effective, cloud-ready data platformFlexible DeploymentFlexible Pricing No vendor lock-inMulti-cloud and on-premTransient and longrunning clustersFlexible cluster topologies Pay-as-you-go cloud usageTraditional node-based licensingSpot instance supportGrow/shrink clustersIntegrated Data PlatformCloud-Native Build end-to-end data appsIngest, process, explore,model, analyze, serveCommon security,governance, metadata,management Direct Spark I/O from S3Data/metadata persistenceacross cluster lifecyclesFast self-service clustersSingle pane of glass for multicluster view Cloudera, Inc. All rights reserved.21

Data Engineering and Data ScienceTwo Common Workload PatternsBatch Processing / ETLExploratoryData Science(also: Testing Environments)(also: Development Environments)Explore and analyze all data,wherever it lives, on demandOnly pay for what you need,when you need it Transient clustersSingle userSized to demandObject storage centricCloud-native deployment Transient or persistentSingle or multi-userElastic workloadHDFS or object storageLift-and-shift or cloud-native deployment Cloudera, Inc. All rights reserved.22

Spark in the CloudSample ArchitectureKafka SparkStreaming onpermanent clusters, forstreaming data ingestand processingSpark batch jobs ontransient clusters, forprocessing or machinelearning, directlyread/write to theobject storeInteractive Spark orImpala for exploratorydata science onpermanent or transientclusters, directlyread/write to theobject storeServing tier (e.g.HBase, Search) onpermanent clusters,serving data to endapplicationsHBase, Search,Model Server, etc.Object Store Cloudera, Inc. All rights reserved.23

Spark 2.0What’s New? Cloudera, Inc. All rights reserved.24

New unified API: Dataset APIDatasetsRDDs Object Oriented Functional Operators map, reduceByKey,cogroup, etc Compile-time Type SafetyDataframes Structured Compact binaryrepresentation Query Optimizer Sort/shuffle withoutdeserialization Cloudera, Inc. All rights reserved.25

Continued Innovation: Structured StreamingSpark Streaming 2.0Streams modeled as continuous Dataframes SQL like syntax to author stream processing Open stream processing to a wider audienceWith a wide array of in-built aggregation and statistical functionsEasier end-to-end exactly once semantics Out-Of-Order data handling Increased performance Growing array of Streaming ML functionality Cloudera, Inc. All rights reserved.26

Continued Innovation: Machine Learning PersistenceSave and Load ModelsSave and Load PipelinesBag ofwordsTokenizeTF-IDFLDAScale &NormalizeFeaturesTrainClassifier*Sequence is repeated during Training and Scoring**Hyper-Parameter Tuning Repeat Sequence with different parameter values Cloudera, Inc. All rights reserved.27

How do I get Spark 2.0?Download our parcel lRead more rk-2-0-0-beta-now-available-for-cdh Cloudera, Inc. All rights reserved.28

Recommended Training for Spark UsersApache Spark Developer Training Data Science at Scale with Sparkand HadoopCloudera University’s three-dayIntroduction to MachineLearningSpark course enablesparticipants to build complete,unified big data applications.The course provides anintroduction to Machine Learning,including coverage ofcollaborative filtering, clustering,classification, algorithms, anddata volume. Cloudera, Inc. All rights reserved. 29Spark and Hadoop aretransforming how data scientistswork by allowing interactive anditerative data analysis at scale.

Thank You Cloudera, Inc. All rights reserved.30

Cloudera is the “stress free” choice for Spark Support: Proactive Support for Spark workloads Expertise: Most Spark users trained. Robust development community. Experience: First to ship and support. Most customers running Spark of any commercial Hadoop Distribution. Cloudera lives where your data lives Run Spark On-prem or in the Public Cloud Cloudera makes Spark enterprise .