Transcription
Harnessing Big Data with SparkLawrence SpracklenAlpine Data
Alpine Data2
Map Reduce Allows the distribution of large data computationsacross a cluster Map()Reduce()BigDataOutput Computations typically composed of a sequence ofMR operations3
MR Performance Multiple disk interactions required in EACH MRoperationMap4Reduce
Performance Hierarchy0.10GB/s0.10GB/s0.60GB/s100X Read Bandwidth580GB/s
Optimizing MR Many companies have significant legacy MRcode– Either direct MR or indirect usage via Pig A variety of techniques to accelerate MR– Apache Tez– Tachyon or Apache ignite– System ML6
Spark Several significant advancements over MR– Generalizes two stage MR into arbitrary DAGs– Enables in-memory dataset caching– Improved usability Reduced disk read/writes delivers significantspeedups– Especially for iterative algorithms like ML7
Perf k-vs-hadoop-mapreduce8
Spark Tuning Increased reliance on memory introducesgreater requirement for tuning Need to understand memory requirements forcaching Significant performance benefits associatedwith “getting it right” Auto-tuning is coming .9
Optimization opportunities Spark delivers improved ML performance usingreduced cluster resources Enables numerous opportunities––––10Reduced time to insightsReduced cluster sizeEliminate subsamplingAutoML
AutoML Data sets increasingly large and complex Increasing difficult to intuitively “know” optimal– Feature engineering– Choice of algorithm– Optimize parameterization of algorithm(s) Significant manual trial-and-error Cult of the algorithm11
Feature Engineering Essential for model performance, efficacy,robustness and simplicity––––Feature extractionFeature selectionFeature constructionFeature elimination Domain/dataset knowledge is important, butbasic automation feasible12
Algorithm selection Select dependent column Indicate classification or regression Press “go”Algorithms run in parallel across clusterMinimally provides good starting pointSignificantly reduces “busy work”13
Hyperparameter optimization Are the default parameters optimal? How do I adjust intelligently– Number of trees? Depth of trees? Splittingcriteria? Tedious trial and error Overfitting danger Intelligent automatic search14
Algorithm tuning Gradient boosted tree parameterization e.g.––––––15# of treesMaximum tree depthLoss functionMinimum node split sizeBagging rateShrinkage
AutoMLDataSetFeatureengineering1)Investigate NML algorithms2) Tune g#3Alg#N162) FeatureeliminationAlg#NAlg#N
Spark is for large datasetsRun time If your data fits on a single node entations/ Other high-performance options exist17
Data set size Large data lakes canconsist of many smallfiles Memory per nodeincreasing ig-data-size-datasets.html18
NVDIMMS Driving significant increases in node memory– Up to 10X increase in density Coming in late 2016 19
Hybrid operators Time consuming to maintain multiple MLlibraries & manually determine optimal choice Develop hybrid implementations thatautomatically choose optimal approach– Data set size– Cluster size– Cluster utilization20
Single-node performance 21
Single-node performance 22
Operationalization What happens after the models are created? How does the business benefit from theinsights? Operationalization is frequently the weak link– Operationalizing PowerPoint?– Hand rolled scoring flows23
PFA Portable Format for Analytics (PFA) Successor to PMML Significant flexibility in encapsulating complexdata preprocessing24
Conclusions Spark delivers significant performanceimprovements over MR– Can introduce more tuning requirements Provides an opportunity for AutoML– Automatically determine good solutions Understand when its appropriate Don’t forget about about operationalization25
Harnessing Big Data with Spark Lawrence Spracklen Alpine Data . 2 Alpine Data . 3 Map Reduce Allows the distribution of large data computations . Portable Format for Analytics (PFA) Successor to PMML Significant flexibility in encapsulating complex data preprocessing . 25