Transcription
cuML: A Library for GPU AcceleratedMachine LearningOnur Yilmaz, Ph.D. oyilmaz@nvidia.com Senior ML/DL Scientist and EngineerCorey Nolet cnolet@nvidia.com Data Scientist and Senior Engineer
About UsOnur Yilmaz, Ph.D.Senior ML/DL Scientist and Engineer on the RAPIDS cuML team at NVIDIAFocuses on building single and multi GPU machine learning algorithms to supportextreme data loads at light-speedPh.D. in computer engineering, focusing on ML for finance.Corey NoletData Scientist & Senior Engineer on the RAPIDS cuML team at NVIDIAFocuses on building and scaling machine learning algorithms to support extremedata loads at light-speedOver a decade experience building massive-scale exploratory data science & realtime analytics platforms for HPC environments in the defense industryWorking towards PhD in Computer Science, focused on unsupervisedrepresentation learning2
Introduction to cuMLAgenda Architecture Overview cuML Deep Dive Benchmarks cuML Roadmap3
Introduction“Details are confusing. It is only by selection, by elimination, by emphasis,that we get to the real meaning of things.” Georgia O'KeefeMother of American Modernism4
Realities of Data5
ProblemData sizes continue to grow6
ProblemData sizes continue to grow7
ProblemData sizes continue to growmin(variance)min(bias)8
ProblemData sizes continue to growHistograms / DistributionsDimension ReductionFeature SelectionRemove OutliersSampling9
ProblemData sizes continue to growHistograms / DistributionsDimension ReductionFeature SelectionRemove OutliersSampling10
ProblemData sizes continue to growHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersSampling11
ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersSampling12
ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersSampling13
ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersIterate.Sampling14
ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate.Sampling15
ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Sampling16
ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Iterate some more.Sampling17
ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.Dimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Iterate some more.SamplingMeet reasonable speed vs accuracy tradeoff18
ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.TimeIncreasesDimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Iterate some more.SamplingMeet reasonable speed vs accuracy tradeoff19
ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.TimeIncreasesDimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Hours?Iterate some more.SamplingMeet reasonable speed vs accuracy tradeoff20
ProblemData sizes continue to growMassive DatasetHistograms / DistributionsBetter to start with as much data aspossible and explore / preprocess to scaleto performance needs.TimeIncreasesDimension ReductionFeature SelectionRemove OutliersIterate. Cross Validate & Grid Search.Hours? Days?Iterate some more.SamplingMeet reasonable speed vs accuracy tradeoff21
ML Workflow Stifles InnovationIt Requires Exploration and IterationsManage DataAllDataETLTrainingStructuredData Tuning &SelectionInferenceIterate Cross Validate Grid Search Iterate some more.Accelerating just Model Training does have benefit but doesn’t address the whole problem22
ML Workflow Stifles InnovationIt Requires Exploration and IterationsManage DataAllDataETLTrainingStructuredData Tuning &SelectionInferenceIterate Cross Validate Grid Search Iterate some more.Accelerating just Model Training does have benefit but doesn’t address the whole problemEnd-to-End acceleration is needed23
Architecture“More data requires better approaches!” Xavier AmatriainCTO,CurAI24
RAPIDS: OPEN GPU DATA SCIENCEcuDF, cuML, and cuGraph mimic well-known librariesData PreparationModel CHE ARROWScikitLearn-like25
HIGH-LEVEL APIsDask-CUMLPythonDask Multi-GPU MLCuMLlibcumlScikit-Learn-LikeCUDA/C ML AlgorithmsML PrimitivesMulti-Node & Multi-GPU CommunicationsHost 1Host 2GPU1GPU3GPU1GPU3GPU2GPU4GPU2GPU426
cuML APIGPU-accelerated machine learning at every layerPythonScikit-learn-like interface for data scientists utilizingcuDF & NumpyAlgorithmsCUDA C API for developers to utilize accelerated machinelearning algorithms.PrimitivesReusable building blocks for composing machine learningalgorithms.27
PrimitivesGPU-accelerated math optimized for feature matricesLinear Algebra Element-wise operations Matrix multiply Norms Eigen Decomposition SVD/RSVD Transpose QR DecompositionStatisticsMatrix / MathRandomDistance / MetricsObjective FunctionsSparse ConversionsMore to come!28
AlgorithmsGPU-accelerated Scikit-LearnClassification / RegressionStatistical InferenceDecision Trees / Random ForestsLinear RegressionLogistic RegressionK-Nearest NeighborsKalman FilteringBayesian InferenceGaussian Mixture ModelsHidden Markov ModelsCross ValidationTimeseries ForecastingK-MeansDBSCANSpectral ClusteringPrincipal ComponentsSingular Value DecompositionUMAPSpectral EmbeddingARIMAHolt-WintersHyper-parameter TuningRecommendationsImplicit Matrix FactorizationClusteringDecomposition & Dimensionality ReductionMore to come!29
HIGH-LEVEL APIsPythonDask Multi-GPU MLData DistributionScikit-Learn-LikeCUDA/C ML AlgorithmsModel ParallelismML PrimitivesMulti-Node / Multi-GPU CommunicationsHost 1Host 2GPU1GPU3GPU1GPU3GPU2GPU4GPU2GPU430
HIGH-LEVEL APIsPythonDask Multi-GPU MLData DistributionScikit-Learn-LikeCUDA/C ML AlgorithmsModel ParallelismML PrimitivesMulti-Node / Multi-GPU CommunicationsHost 1Host 2GPU1GPU3GPU1GPU3GPU2GPU4GPU2GPU4 Portability Efficiency Speed31
Dask cuMLDistributed Data-parallelism Layer Distributed computation scheduler for Python Scales up and out Distributes data across processes Enables model-parallel cuML algorithms32
ML Technology StackPythonCythoncuML AlgorithmscuML PrimsCUDA LibrariesCUDADask cuMLDask secuRandcuBlas33
cuML Deep Dive“I would posit that every scientist is a data scientist.” Arun SubramaniyanV.P. of Data Science & Analytics, Baker Hughes, a GE Company34
Linear Regression (OLS)Python LayerPandascuDF35
Linear Regression (OLS)Python LayercuDF36
Linear Regression (OLS)Python LayerScikit-LearncuML37
Linear Regression (OLS)Python LayerScikit-LearncuML38
Linear Regression (OLS)Python LayerScikit-LearncuML39
Linear Regression (OLS)cuML Algorithms CUDA C Layer40
Linear Regression (OLS)cuML Algorithms CUDA C Layer41
Linear Regression (OLS)cuML ML-Prims CUDA C Layer42
Linear Regression (OLS)cuML ML-Prims CUDA C Layer43
Linear Regression (OLS)cuML ML-Prims CUDA C Layer44
Linear Regression (OLS)cuML ML-Prims CUDA C Layer45
Linear Regression (OLS)cuML ML-Prims CUDA C Layer46
Linear Regression (OLS)cuML ML-Prims CUDA C Layer47
Linear Regression (OLS)Vector bMatrix AcuML ML-Prims CUDA C Layerc1c2c3 .cN .48
Benchmarks49
ALGORITHMSBenchmarked on DGX150
UMAPReleased in 0.6!51
cuDF XGBoostDGX-2 vs Scale Out CPU Cluster Full end to end pipelineLeveraging Dask cuDFStore each GPU results in sys mem then read back inArrow to Dmatrix (CSR) for XGBoost52
cuDF XGBoostScale Out GPU Cluster vs DGX-2Chart Title DGX-2Full end to end pipelineLeveraging Dask for multi-node cuDFStore each GPU results in sys mem then read back inArrow to Dmatrix (CSR) for XGBoost5x DGX-1050100150ETL CSV (s)200ML Prep (s)250300350ML (s)53
cuDF XGBoostFully In- GPU Benchmarks Full end to end pipelineLeveraging Dask cuDFNo Data Prep time all in memoryArrow to Dmatrix (CSR) for XGBoost54
XGBoostMulti-node, Multi-GPU PerformanceBenchmark229020 CPU Nodes30 CPU Nodes200GB CSV dataset; Data preparationincludes joins, variable transformations.195620 CPU Nodes30 CPU Nodes199950 CPU Nodes50 CPU Nodes100 CPU Nodes1695x DGX-11570500100015002000CPU nodes (61 GiB of memory, 8 vCPUs,64-bit platform), Apache Spark100 CPU Nodes1948DGX-2CPU Cluster ConfigurationDGX-2DGX Cluster Configuration5x DGX-1DGX nodes on InfiniBand network250055
Single Node Multi-GPUWill be Released in 0.6Linear Regression Reduction: 40mins - 1minSize: 225gbSystem: DGX2tSVD Reduction: 1.6hrs- 1.5minSize: 220gbSystem: DGX2Nearest Neighbors Reduction: 4 hrs- 30sec Size: 128gb System: DGX156
Roadmap“Data science is the fourth pillar of the scientific method!” Jensen Huang57
CUMLSingle GPU and XGBoostcuMLGradient Boosted Decision Trees(GBDT)SGMGMGMNGLMLogistic RegressionRandom Forest (regression)K-MeansK-NNDBSCANUMAPARIMAKalman FilterHolts-WintersPrincipal ComponentsSingular Value Decomposition58
DASK-CUMLOLS, tSVD, and KNN in RAPIDS 0.6cuMLGradient Boosted Decision Trees(GBDT)SGMGMGMNGLMLogistic RegressionRandom Forest (regression)K-MeansK-NNDBSCANUMAPARIMAKalman FilterHolts-WintersPrincipal ComponentsSingular Value Decomposition59
DASK-CUMLK-Means*, DBSCAN & PCA in RAPIDS 0.7/0.8cuMLGradient Boosted Decision Trees(GBDT)SGMGMGMNGLMLogistic RegressionRandom Forest (regression)K-MeansK-NNDBSCANUMAPARIMAKalman FilterHolts-WintersPrincipal Components Deprecating the current K-means in 0.6 for new K-means built on MLPrimsSingular Value Decomposition60
CuML 0.6Will be released with RAPIDS 0.6 on Friday!New AlgorithmsNotable Improvements Stochastic Gradient Descent [Single GPU] Exposing support for hyperparsmeter tuning UMAP [Single GPU] Removing external requirement on FAISS Linear Regression (OLS) [Single Node, Multi-GPU] Lowered Nearest Neighbors memory requirement Truncated SVD [Single Node, Multi-GPU]61
Thank you!Corey Nolet: @cjnoletOnur Yilmaz: umlhttps://github.com/dask-cuml
Python Algorithms Primitives GPU-accelerated machine learning at every layer Scikit-learn-like interface for data scientists utilizing cuDF& Numpy CUDA C API for developers to utilize accelerated machine learning algorithms. Reusable building blocks for composing machine learning algorithms.