Harnessing Big Data With Spark

Transcription

Harnessing Big Data with SparkLawrence SpracklenAlpine Data

Alpine Data2

Map Reduce Allows the distribution of large data computationsacross a cluster Map()Reduce()BigDataOutput Computations typically composed of a sequence ofMR operations3

MR Performance Multiple disk interactions required in EACH MRoperationMap4Reduce

Performance Hierarchy0.10GB/s0.10GB/s0.60GB/s100X Read Bandwidth580GB/s

Optimizing MR Many companies have significant legacy MRcode– Either direct MR or indirect usage via Pig A variety of techniques to accelerate MR– Apache Tez– Tachyon or Apache ignite– System ML6

Spark Several significant advancements over MR– Generalizes two stage MR into arbitrary DAGs– Enables in-memory dataset caching– Improved usability Reduced disk read/writes delivers significantspeedups– Especially for iterative algorithms like ML7

Perf k-vs-hadoop-mapreduce8

Spark Tuning Increased reliance on memory introducesgreater requirement for tuning Need to understand memory requirements forcaching Significant performance benefits associatedwith “getting it right” Auto-tuning is coming .9

Optimization opportunities Spark delivers improved ML performance usingreduced cluster resources Enables numerous opportunities––––10Reduced time to insightsReduced cluster sizeEliminate subsamplingAutoML

AutoML Data sets increasingly large and complex Increasing difficult to intuitively “know” optimal– Feature engineering– Choice of algorithm– Optimize parameterization of algorithm(s) Significant manual trial-and-error Cult of the algorithm11

Feature Engineering Essential for model performance, efficacy,robustness and simplicity––––Feature extractionFeature selectionFeature constructionFeature elimination Domain/dataset knowledge is important, butbasic automation feasible12

Algorithm selection Select dependent column Indicate classification or regression Press “go”Algorithms run in parallel across clusterMinimally provides good starting pointSignificantly reduces “busy work”13

Hyperparameter optimization Are the default parameters optimal? How do I adjust intelligently– Number of trees? Depth of trees? Splittingcriteria? Tedious trial and error Overfitting danger Intelligent automatic search14

Algorithm tuning Gradient boosted tree parameterization e.g.––––––15# of treesMaximum tree depthLoss functionMinimum node split sizeBagging rateShrinkage

AutoMLDataSetFeatureengineering1)Investigate NML algorithms2) Tune g#3Alg#N162) FeatureeliminationAlg#NAlg#N

Spark is for large datasetsRun time If your data fits on a single node entations/ Other high-performance options exist17

Data set size Large data lakes canconsist of many smallfiles Memory per nodeincreasing ig-data-size-datasets.html18

NVDIMMS Driving significant increases in node memory– Up to 10X increase in density Coming in late 2016 19

Hybrid operators Time consuming to maintain multiple MLlibraries & manually determine optimal choice Develop hybrid implementations thatautomatically choose optimal approach– Data set size– Cluster size– Cluster utilization20

Single-node performance 21

Single-node performance 22

Operationalization What happens after the models are created? How does the business benefit from theinsights? Operationalization is frequently the weak link– Operationalizing PowerPoint?– Hand rolled scoring flows23

PFA Portable Format for Analytics (PFA) Successor to PMML Significant flexibility in encapsulating complexdata preprocessing24

Conclusions Spark delivers significant performanceimprovements over MR– Can introduce more tuning requirements Provides an opportunity for AutoML– Automatically determine good solutions Understand when its appropriate Don’t forget about about operationalization25

Harnessing Big Data with Spark Lawrence Spracklen Alpine Data . 2 Alpine Data . 3 Map Reduce Allows the distribution of large data computations . Portable Format for Analytics (PFA) Successor to PMML Significant flexibility in encapsulating complex data preprocessing . 25