CuML: A Library For GPU Accelerated Machine Learning

Transcription

cuML: A Library for GPU AcceleratedMachine LearningOnur Yilmaz, Ph.D. oyilmaz@nvidia.com Senior ML/DL Scientist and EngineerCorey Nolet cnolet@nvidia.com Data Scientist and Senior Engineer

About UsOnur Yilmaz, Ph.D.Senior ML/DL Scientist and Engineer on the RAPIDS cuML team at NVIDIAFocuses on building single and multi GPU machine learning algorithms to supportextreme data loads at light-speedPh.D. in computer engineering, focusing on ML for finance.Corey NoletData Scientist & Senior Engineer on the RAPIDS cuML team at NVIDIAFocuses on building and scaling machine learning algorithms to support extremedata loads at light-speedOver a decade experience building massive-scale exploratory data science & realtime analytics platforms for HPC environments in the defense industryWorking towards PhD in Computer Science, focused on unsupervisedrepresentation learning2

Introduction to cuMLAgenda Architecture Overview cuML Deep Dive Benchmarks cuML Roadmap3

Introduction“Details are confusing. It is only by selection, by elimination, by emphasis,that we get to the real meaning of things.” Georgia O'KeefeMother of American Modernism4

Realities of Data5

ProblemData sizes continue to grow6

ProblemData sizes continue to grow7

ProblemData sizes continue to growmin(variance)min(bias)8

ProblemData sizes continue to growHistograms / DistributionsDimension ReductionFeature SelectionRemove OutliersSampling9

ProblemData sizes continue to growHistograms / DistributionsDimension ReductionFeature SelectionRemove OutliersSampling10

ProblemData sizes continue to growHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersSampling11

ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersSampling12

ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersSampling13

ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersIterate.Sampling14

ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate.Sampling15

ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Sampling16

ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Iterate some more.Sampling17

ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Iterate some more.SamplingMeet reasonable speed vs accuracy tradeoff18

ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.TimeIncreasesDimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Iterate some more.SamplingMeet reasonable speed vs accuracy tradeoff19

ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.TimeIncreasesDimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Hours?Iterate some more.SamplingMeet reasonable speed vs accuracy tradeoff20

ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.TimeIncreasesDimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Hours? Days?Iterate some more.SamplingMeet reasonable speed vs accuracy tradeoff21

ML Workflow Stifles InnovationIt Requires Exploration and IterationsManage DataAllDataETLTrainingStructuredData Tuning &SelectionInferenceIterate Cross Validate Grid Search Iterate some more.Accelerating just Model Training does have benefit but doesn’t address the whole problem22

ML Workflow Stifles InnovationIt Requires Exploration and IterationsManage DataAllDataETLTrainingStructuredData Tuning &SelectionInferenceIterate Cross Validate Grid Search Iterate some more.Accelerating just Model Training does have benefit but doesn’t address the whole problemEnd-to-End acceleration is needed23

Architecture“More data requires better approaches!” Xavier AmatriainCTO,CurAI24

RAPIDS: OPEN GPU DATA SCIENCEcuDF, cuML, and cuGraph mimic well-known librariesData PreparationModel CHE ARROWScikitLearn-like25

HIGH-LEVEL APIsDask-CUMLPythonDask Multi-GPU MLCuMLlibcumlScikit-Learn-LikeCUDA/C ML AlgorithmsML PrimitivesMulti-Node & Multi-GPU CommunicationsHost 1Host 2GPU1GPU3GPU1GPU3GPU2GPU4GPU2GPU426

cuML APIGPU-accelerated machine learning at every layerPythonScikit-learn-like interface for data scientists utilizingcuDF & NumpyAlgorithmsCUDA C API for developers to utilize accelerated machinelearning algorithms.PrimitivesReusable building blocks for composing machine learningalgorithms.27

PrimitivesGPU-accelerated math optimized for feature matricesLinear Algebra Element-wise operations Matrix multiply Norms Eigen Decomposition SVD/RSVD Transpose QR DecompositionStatisticsMatrix / MathRandomDistance / MetricsObjective FunctionsSparse ConversionsMore to come!28

AlgorithmsGPU-accelerated Scikit-LearnClassification / RegressionStatistical InferenceDecision Trees / Random ForestsLinear RegressionLogistic RegressionK-Nearest NeighborsKalman FilteringBayesian InferenceGaussian Mixture ModelsHidden Markov ModelsCross ValidationTimeseries ForecastingK-MeansDBSCANSpectral ClusteringPrincipal ComponentsSingular Value DecompositionUMAPSpectral EmbeddingARIMAHolt-WintersHyper-parameter TuningRecommendationsImplicit Matrix FactorizationClusteringDecomposition & Dimensionality ReductionMore to come!29

HIGH-LEVEL APIsPythonDask Multi-GPU MLData DistributionScikit-Learn-LikeCUDA/C ML AlgorithmsModel ParallelismML PrimitivesMulti-Node / Multi-GPU CommunicationsHost 1Host 2GPU1GPU3GPU1GPU3GPU2GPU4GPU2GPU430

HIGH-LEVEL APIsPythonDask Multi-GPU MLData DistributionScikit-Learn-LikeCUDA/C ML AlgorithmsModel ParallelismML PrimitivesMulti-Node / Multi-GPU CommunicationsHost 1Host 2GPU1GPU3GPU1GPU3GPU2GPU4GPU2GPU4 Portability Efficiency Speed31

Dask cuMLDistributed Data-parallelism Layer Distributed computation scheduler for Python Scales up and out Distributes data across processes Enables model-parallel cuML algorithms32

ML Technology StackPythonCythoncuML AlgorithmscuML PrimsCUDA LibrariesCUDADask cuMLDask secuRandcuBlas33

cuML Deep Dive“I would posit that every scientist is a data scientist.” Arun SubramaniyanV.P. of Data Science & Analytics, Baker Hughes, a GE Company34

Linear Regression (OLS)Python LayerPandascuDF35

Linear Regression (OLS)Python LayercuDF36

Linear Regression (OLS)Python LayerScikit-LearncuML37

Linear Regression (OLS)Python LayerScikit-LearncuML38

Linear Regression (OLS)Python LayerScikit-LearncuML39

Linear Regression (OLS)cuML Algorithms CUDA C Layer40

Linear Regression (OLS)cuML Algorithms CUDA C Layer41

Linear Regression (OLS)cuML ML-Prims CUDA C Layer42

Linear Regression (OLS)cuML ML-Prims CUDA C Layer43

Linear Regression (OLS)cuML ML-Prims CUDA C Layer44

Linear Regression (OLS)cuML ML-Prims CUDA C Layer45

Linear Regression (OLS)cuML ML-Prims CUDA C Layer46

Linear Regression (OLS)cuML ML-Prims CUDA C Layer47

Linear Regression (OLS)Vector bMatrix AcuML ML-Prims CUDA C Layerc1c2c3 .cN .48

Benchmarks49

ALGORITHMSBenchmarked on DGX150

UMAPReleased in 0.6!51

cuDF XGBoostDGX-2 vs Scale Out CPU Cluster Full end to end pipelineLeveraging Dask cuDFStore each GPU results in sys mem then read back inArrow to Dmatrix (CSR) for XGBoost52

cuDF XGBoostScale Out GPU Cluster vs DGX-2Chart Title DGX-2Full end to end pipelineLeveraging Dask for multi-node cuDFStore each GPU results in sys mem then read back inArrow to Dmatrix (CSR) for XGBoost5x DGX-1050100150ETL CSV (s)200ML Prep (s)250300350ML (s)53

cuDF XGBoostFully In- GPU Benchmarks Full end to end pipelineLeveraging Dask cuDFNo Data Prep time all in memoryArrow to Dmatrix (CSR) for XGBoost54

XGBoostMulti-node, Multi-GPU PerformanceBenchmark229020 CPU Nodes30 CPU Nodes200GB CSV dataset; Data preparationincludes joins, variable transformations.195620 CPU Nodes30 CPU Nodes199950 CPU Nodes50 CPU Nodes100 CPU Nodes1695x DGX-11570500100015002000CPU nodes (61 GiB of memory, 8 vCPUs,64-bit platform), Apache Spark100 CPU Nodes1948DGX-2CPU Cluster ConfigurationDGX-2DGX Cluster Configuration5x DGX-1DGX nodes on InfiniBand network250055

Single Node Multi-GPUWill be Released in 0.6Linear Regression Reduction: 40mins - 1minSize: 225gbSystem: DGX2tSVD Reduction: 1.6hrs- 1.5minSize: 220gbSystem: DGX2Nearest Neighbors Reduction: 4 hrs- 30sec Size: 128gb System: DGX156

Roadmap“Data science is the fourth pillar of the scientific method!” Jensen Huang57

CUMLSingle GPU and XGBoostcuMLGradient Boosted Decision Trees(GBDT)SGMGMGMNGLMLogistic RegressionRandom Forest (regression)K-MeansK-NNDBSCANUMAPARIMAKalman FilterHolts-WintersPrincipal ComponentsSingular Value Decomposition58

DASK-CUMLOLS, tSVD, and KNN in RAPIDS 0.6cuMLGradient Boosted Decision Trees(GBDT)SGMGMGMNGLMLogistic RegressionRandom Forest (regression)K-MeansK-NNDBSCANUMAPARIMAKalman FilterHolts-WintersPrincipal ComponentsSingular Value Decomposition59

DASK-CUMLK-Means*, DBSCAN & PCA in RAPIDS 0.7/0.8cuMLGradient Boosted Decision Trees(GBDT)SGMGMGMNGLMLogistic RegressionRandom Forest (regression)K-MeansK-NNDBSCANUMAPARIMAKalman FilterHolts-WintersPrincipal Components Deprecating the current K-means in 0.6 for new K-means built on MLPrimsSingular Value Decomposition60

CuML 0.6Will be released with RAPIDS 0.6 on Friday!New AlgorithmsNotable Improvements Stochastic Gradient Descent [Single GPU] Exposing support for hyperparsmeter tuning UMAP [Single GPU] Removing external requirement on FAISS Linear Regression (OLS) [Single Node, Multi-GPU] Lowered Nearest Neighbors memory requirement Truncated SVD [Single Node, Multi-GPU]61

Thank you!Corey Nolet: @cjnoletOnur Yilmaz: umlhttps://github.com/dask-cuml

Python Algorithms Primitives GPU-accelerated machine learning at every layer Scikit-learn-like interface for data scientists utilizing cuDF& Numpy CUDA C API for developers to utilize accelerated machine learning algorithms. Reusable building blocks for composing machine learning algorithms.