End-to-End Big Data AI Pipeline On Ray And Apache Spark Using Analytics Zoo

Transcription

End-to-End Big Data AI Pipeline on Rayand Apache Spark using Analytics ZooJason DaiIntel FellowCVPR 2021 Tutorial

Agenda End-to-End Big Data AI Pipelines Analytics Zoo: Open Source Platform for Big Data AI Case Study Seamlessly Scaling out Big Data AI using Orca in Analytics ZooCVPR 2021 Tutorial2

Agenda End-to-End Big Data AI Pipelines Analytics Zoo: Open Source Platform for Big Data AI Case Study Seamlessly Scaling out Big Data AI using Orca in Analytics ZooCVPR 2021 Tutorial3

Open Source Big Data AI Projects at IntelDistributed deep learningframework for Apache rameML PipelineBig Data AI Platform(distributed TF, PyTorch, BigDL, Rayand OpenVINO on Apache -zooCVPR 2021 TutorialSQL SparkR Streaming GraphMLlibCoreAnalytics Zoo4

Unified Architecture for E2E AI PipelinesTraditional, Segregated Infrastructure Data Lake /WarehouseBig Data ClusterDeep Learning ClusterIntermediate FilesDistributedFile SystemUnified Architecture (using Analytics Zoo)Python API (Spark driver)Spark TaskSQL Data Lake File SystemYARN, K8s, CloudUnified Big Data AI ClusterCVPR 2021 Tutorial5

Distributed Deep Learning as Spark JobsWorkerDL App on DriverAnalyticsZooSpark(or BigDL) SparkProgram LibraryjobsSparkExecutorSpark TF / PyTorchTask / BigDL libWorkerWorkeroneDNNStandard Spark jobsWorkerSparkExecutorSpark TF / PyTorchTask / BigDL libWorkerWorkeroneDNN Standard Spark jobs Run distributed DL on existing, general-purpose Big Data clusters (Spark, Hadoop, K8s, Hosted, .) Seamless integration with Big Data ecosystem (Spark Dataframes & MLlib, Kafka, etc.) Iterative, data-parallel, synchronous SGD Each jobs runs a training iteration, each task runs the same model on a subset of the batch Efficient AllReduce built on top of existing Spark primitives* “BigDL: A Distributed Deep Learning Framework for Big Data”, ACM SoCC 2019, https://arxiv.org/abs/1804.05839CVPR 2021 Tutorial6

Agenda End-to-End Big Data AI Pipelines Analytics Zoo: Open Source Platform for Big Data AI Case Study Seamlessly Scaling out Big Data AI using Orca in Analytics ZooCVPR 2021 Tutorial7

Analytics Zoo Stack for Big Data AIChronosScalable AutoML for Time Series PredictionPPMLPrivacy Preserving Big Data Analytics & ML on SGXRayOnSpark Run Ray programs directly on Spark ClusterOrcaSeamless scale out TF, PyTorch, BigDL & OpenVINO on SparkBigDLDistributed deep learning library for SparkPowered by oneAPICVPR 2021 Tutorial8

BigDL: Distributed DL Framework for SparkKeras-like API and Spark ML Pipeline Support (Python and Scala APIs)#Keras-like API for BigDLmodel Sequential().add(InputLayer(inputShape Shape(10)) ompile( )#Spark Dataframe preprocessingtrainingDF spark.read.parquet("train data")validationDF spark.read.parquet("val data")#Spark ML Pipeline for BigDLscaler MinMaxScaler(inputCol "in", outputCol "value")estimator NNEstimator(model, CrossEntropyCriterion()) Epoch(epoch)pipeline Pipeline().setStages([scaler, estimator])pipelineModel pipeline.fit(trainingDF)predictions pipelineModel.transform(validationDF)Analytics Zoo API in blueCVPR 2021 Tutorial9

Orca: Distributed TF/PyTorch/BigDL on SparkWrite TensorFlow/PyTorch inline with Spark Program#PySpark DataFrametrain df sqlcontext.read.parquet( ).withColumn( )#TensorFlow Modelfrom tensorflow import keras model keras.Model(inputs [user, item], outputs outputs)model.compile(optimizer "adam",loss "sparse categorical crossentropy",metrics ['accuracy’])#Distributed training on Sparkfrom zoo.orca.learn.tf.estimator import Estimatorest Estimator.from keras(keras model model)est.fit(train df, feature cols ['user', 'item’], label cols ['label’])Analytics Zoo API in blueCVPR 2021 Tutorial10

RayOnSpark: Run Ray Programs Directly on Sparkfrom zoo.orca import init orca contextsc init orca context(cluster mode "yarn",init ray on spark True).,YARN/K8s/Standalone/Hostedimport ray@ray.remoteclass Counter(object):def init (self):self.n 0def inc(self):self.n 1return self.ncounters [Counter.remote() for i in range(5)]print(ray.get([c.inc.remote() for c in counters]))Analytics Zoo API in blueCVPR 2021 Tutorial11

PPML: Privacy Preserving Big Data Analytics & ML on SGXApplicationsCloudFSIHLSTrusted Big Data AI on UntrustedCloud Analytics ZooTrustedBig Data AIPlatformSecurityTechnologyE2E Big Data AI PipelineSecure DataAccessTrusted ExecutionEnvironment (SGX) Secure Gradient &Parameter SyncLibOS(Graphene,Occlum)SecureAlignment Compute / memory protected bySGX enclaves Network protected by TLS andremote attestation Storage (e.g., data and model)protected by encryptionSecure Libraries(HE, DP, MPC, User request / response protectedCrypto, )by TLS and encryptionCVPR 2021 Tutorial12

Chronos: Scalable AutoML for Time Series Predictionsc init orca context(.,init ray on spark True)auto est AutoProphet(.)#auto est AutoXGBRegressor( )data get data()search space {"changepoint prior scale": .,"seasonality prior scale": .,.}rolling, scaling, featuregeneration, etc.FeatureTransformewith tunablerparametersCVPR 2021 Tutorialtrial jobstrialbest model/parameterswith tunable parametersauto est.fit(data data,search space search space,.)best model auto est.get best model()Analytics Zoo API in blueSearchEngineModelXGBoost, Prophet, TCN,Seq2Seq, etc.Each trial runs adifferent combination ofhyper parametersSearch spacePipelineconfigured with bestparameters/modeltrial trialtrialRay TuneRayOnSpark13

Agenda End-to-End Big Data AI Pipelines Analytics Zoo: Open Source Platform for Big Data AI Case Study Seamlessly Scaling out Big Data AI using Orca in Analytics ZooCVPR 2021 Tutorial14

Burger King’s Offer Recommendation System: DeepFlame BERT ResNET50 TxT*K-Means K-MeansClustering basedon customer’sbehavior datasuch as averagespend, primaryservice channel,average ticketGPM, and visitfrequency, zoo/zoo/models/recommendation/txt.pySource: “Offer Recommendation System with Apache Spark at Burger King”, Luyang Wang and Kai Huang, Data AI Summit 2021CVPR 2021 Tutorial15

DeepFlame Overview – Model TrainingA hybrid approach that allows SME to easily maintain and modify offer rules based on segmentationswhile still allowing DL models to automatically pick the best offers according to preset offer rules.TxTSource: “Offer Recommendation System with Apache Spark at Burger King”, Luyang Wang and Kai Huang, Data AI Summit 2021CVPR 2021 Tutorial16

Offer Recommendation System In Production Only a single cluster is needed for storing data,performing data analytics and distributedtraining.POJO-style API for real-time inference withlow latency.Offline Training (Yarn Cluster)Maintain training codeSQL, MLlibResNET50BERTTxTOnline Serving (Kubernetes Cluster)Model RegistryRequestLoad/update model t new clickstream dataSource: “Offer Recommendation System with Apache Spark at Burger King”, Luyang Wang and Kai Huang, Data AI Summit 2021CVPR 2021 Tutorial17

Case Study: Image Feature Extraction at JD.comImage FeatureExtraction:Applications:Similar Image SearchQueryImage DeduplicationSearch ResultSource: “Bringing deep learning into big data analytics using BigDL”, Xianyan Jia and Zhenhua Wang, Strata Data Conference Singapore 2017CVPR 2021 Tutorial18

3.83x Speedup of E2E Inference Pipeline at JD.com3.83x speedup for end-to-end inference running BigDL on Xeon (vs. Nvidia GPU severs)** t-jdcomFor more complete information about performance and benchmark results, visit www.intel.com/benchmarks.CVPR 2021 Tutorial19

Case Study: Time Series Based Network QualityPrediction in SK ove-network-qualityCVPR 2021 Tutorial20

Up-to 3x End-to-End Speedup at SK Telecom3x speedup for E2E inference runningAnalytics Zoo on Xeon*30 50% speedup for training throughput runningAnalytics Zoo on Xeon** work-qualityFor more complete information about performance and benchmark results, visit www.intel.com/benchmarks.CVPR 2021 Tutorial21

Agenda End-to-End Big Data AI Pipelines Analytics Zoo: Open Source Platform for Big Data AI Case Study Seamlessly Scaling out Big Data AI using Orca in Analytics ZooCVPR 2021 Tutorial22

Orca: Distributed TF/PyTorch/BigDL on SparkWrite TensorFlow/PyTorch inline with Spark programPyTorch Quickstart Seamless scaling-out of standard PyTorch Dataloader and model on distributed tmlCVPR 2021 Tutorial23

Orca: Distributed TF/PyTorch/BigDL on SparkWrite TensorFlow/PyTorch inline with Spark programTensorFlow 1.15 Quickstart Seamless scaling-out of standard TensorFlow Dataset and compute graph on distributed PR 2021 Tutorial24

Orca: Distributed TF/PyTorch/BigDL on SparkWrite TensorFlow/PyTorch inline with Spark programKeras Quickstart Seamless scaling-out of standard TensorFlow Dataset and Keras model on distributed lCVPR 2021 Tutorial25

Orca: Distributed TF/PyTorch/BigDL on SparkWrite TensorFlow/PyTorch inline with Spark programDistributed Pandas with XShards for Deep Learning XShards: Seamless scaling-out of existing Python codes in a distributed and data-parallel est/doc/UseCase/xshards-pandas.htmlCVPR 2021 Tutorial26

End-to-End Big Data AI Pipelines on OrcaSeamless Scaling-Out ofEnd-to-End AI PipelineComputerVisionPipelinesDATAI N G E S T I ONFEATUREE N G I N EER I N GMassive amount ofsmall (image) files ondistributed file systemDistributed (image)preprocessing andtransformationsTRAININGDistributed trainingCVPR 2021 TutorialI N F E R EN C EDistributed inference27

Organizing Massive Amount of Image Files forDistributed ClusterHeaderColumn 1 (ids) Conventional approach Directory of many small image files Inefficient for distributed storage system Orca libraryApacheParquetformatRows(Group 1)Column 2 (images)Column 3 (labels) Rows(Group 2) Footer Storing small image files as large file(s) in Apache Parquet format Directly read as TensorFlow Dataset or PyTorch Dataload in a distributed fashionfrom orca.data import image#Support common image formats (image directory, ImageNet, VOC, COCO, etc.)image.write parquet(format, path, )#Support TensorFlow Dataset, PyTorch DataLoader, etc.data image.read parquet(format, path, )CVPR 2021 Tutorial28

Distributed YOLO V3 Training#Init Orca Contextfrom zoo.orca import init orca context, stop orca contextinit orca context(cluster mode “k8s", .)#Prepare Datafrom orca.data import imageimage.write parquet(“voc”, input path, )#Process Datadef train data creator(config, batch size):train dataset image.read parquet("tf dataset", voc train path, )train dataset train dataset.shuffle(buffer size 512)train dataset train dataset.map( )train dataset train dataset.batch(batch size)return train datasetCVPR 2021 Tutorial29

Distributed YOLO V3 Training#Define TensorFlow modelfrom tensorflow import kerasdef model creator(config):model YoloV3(DEFAULT IMAGE SIZE, training True, classes 80).optimizer keras.optimizers.Adam(lr 1e-3)loss [YoloLoss(anchors[mask], classes options.class num)for mask in anchor masks]model.compile(optimizer optimizer, loss loss,run eagerly False)return model#Distributed Trainingtrainer Estimator.from keras(model creator model creator)trainer.fit(train data creator, epochs 3, .)stop orca context()CVPR 2021 Tutorial30

Summary Analytics Zoo: Software Platform for Big Data AI E2E Big Data AI pipeline (distributed TF/PyTorch/BigDL/OpenVINO on Spark & Ray) Advanced AI workflow (AutoML, Time-Series, PPML, etc.) Github Project repo: https://github.com/intel-analytics/analytics-zoo Use case: /Application/powered-by.html Technical paper/tutorials ACM SoCC 2019 paper: https://arxiv.org/abs/1804.05839 CVPR 2021 tutorial: https://jason-dai.github.io/cvpr2021/ AAAI 2019 tutorial: https://jason-dai.github.io/aaai2019/CVPR 2021 Tutorial31

Run distributed DL on existing, general-purpose Big Data clusters (Spark, Hadoop, K8s, Hosted, .) Seamless integration with Big Data ecosystem (Spark Dataframes & MLlib, Kafka, etc.) Iterative, data-parallel, synchronous SGD Each jobs runs a training iteration, each task runs the same model on a subset of the batch