Distributed Machine Learning Frameworks - GitHub Pages

Transcription

Distributed Machine Learning FrameworksAmir H. Payberahpayberah@kth.se2020-12-07

The Course Web Pagehttps://fid3024.github.io1 / 64

Review of the Current Frameworks2 / 64

TensorFlow (1/4)ITensorFlow supports data parallelism and model partitioning (as of v0.8).3 / 64

TensorFlow (1/4)ITensorFlow supports data parallelism and model partitioning (as of v0.8).IAs of v2.2, the Multi Worker Mirrored Strategy (allreduce) is integrated into TensorFlow for data parallelism.3 / 64

TensorFlow (1/4)ITensorFlow supports data parallelism and model partitioning (as of v0.8).IAs of v2.2, the Multi Worker Mirrored Strategy (allreduce) is integrated into TensorFlow for data parallelism. Its update rule is synchronous and it has communication and computation overlapped.3 / 64

TensorFlow (1/4)ITensorFlow supports data parallelism and model partitioning (as of v0.8).IAs of v2.2, the Multi Worker Mirrored Strategy (allreduce) is integrated into TensorFlow for data parallelism. IIts update rule is synchronous and it has communication and computation overlapped.TensorFlow also has extensions to support different parallelization approaches.3 / 64

TensorFlow (2/4)IMesh-TensorFlow is a language for distributed deep learning in TensorFlow.4 / 64

TensorFlow (2/4)IMesh-TensorFlow is a language for distributed deep learning in TensorFlow.IIt is capable of specifying a broad class of distributed tensor computations.4 / 64

TensorFlow (2/4)IMesh-TensorFlow is a language for distributed deep learning in TensorFlow.IIt is capable of specifying a broad class of distributed tensor computations.IMainly used for model parallelism in TensorFlow.4 / 64

TensorFlow (2/4)IMesh-TensorFlow is a language for distributed deep learning in TensorFlow.IIt is capable of specifying a broad class of distributed tensor computations.IMainly used for model parallelism in TensorFlow.IA mesh is an n-dimensional array of processors, connected by a network.4 / 64

TensorFlow (2/4)IMesh-TensorFlow is a language for distributed deep learning in TensorFlow.IIt is capable of specifying a broad class of distributed tensor computations.IMainly used for model parallelism in TensorFlow.IA mesh is an n-dimensional array of processors, connected by a network.IEach tensor is distributed across all processors in a mesh.4 / 64

TensorFlow (3/4)IGPipe is a pipeline parallelism library implemented under Lingvo (a TensorFlow framework focusing on seq-to-seq models).[Huang et al., GPipe:Efficient Training of Giant Neural Networks using Pipeline Parallelism, 2019]5 / 64

TensorFlow (3/4)IGPipe is a pipeline parallelism library implemented under Lingvo (a TensorFlow framework focusing on seq-to-seq models).IPartitions operation in the forward and backward pass and allows data transfer between neighboring partitions.[Huang et al., GPipe:Efficient Training of Giant Neural Networks using Pipeline Parallelism, 2019]5 / 64

TensorFlow (4/4)IHyPar-Flow is an implementation of data, model, and hybrid parallelization on EagerTensorFlow.[Awan et al., HyPar-Flow:Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow, 2020]6 / 64

TensorFlow (4/4)IHyPar-Flow is an implementation of data, model, and hybrid parallelization on EagerTensorFlow.IIt only requires the strategy, the number of model partitions, and the number of modelreplicas from the user to utilize them with every possible intra-iteration parallelization.[Awan et al., HyPar-Flow:Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training with TensorFlow, 2020]6 / 64

PyTorch (1/4)ICaffe is a DL framework that does not support distributed training out-of-the-box.7 / 64

PyTorch (1/4)ICaffe is a DL framework that does not support distributed training out-of-the-box.IMany extensions of Caffe to support distributed training centralized or decentralized.7 / 64

PyTorch (1/4)ICaffe is a DL framework that does not support distributed training out-of-the-box.IMany extensions of Caffe to support distributed training centralized or decentralized.IFireCaffe and MPI-Caffe support data and model parallelism on multi-GPU clusters,respectively.7 / 64

PyTorch (1/4)ICaffe is a DL framework that does not support distributed training out-of-the-box.IMany extensions of Caffe to support distributed training centralized or decentralized.IFireCaffe and MPI-Caffe support data and model parallelism on multi-GPU clusters,respectively.IIntel-Caffe and NUMA-Caffe support data parallelism training on CPU-based clusters.7 / 64

PyTorch (1/4)ICaffe is a DL framework that does not support distributed training out-of-the-box.IMany extensions of Caffe to support distributed training centralized or decentralized.IFireCaffe and MPI-Caffe support data and model parallelism on multi-GPU clusters,respectively.IIntel-Caffe and NUMA-Caffe support data parallelism training on CPU-based clusters.IS-Caffe is a CUDA-Aware MPI runtime and Caffe for data parallelism on GPU clusters.7 / 64

PyTorch (2/4)IChainer is a Define-by-Run (imperative) DL framework.8 / 64

PyTorch (2/4)IChainer is a Define-by-Run (imperative) DL framework.IIt only supports data parallelism.8 / 64

PyTorch (2/4)IChainer is a Define-by-Run (imperative) DL framework.IIt only supports data parallelism.IIt has a synchronous decentralized design for allreduce communication.8 / 64

PyTorch (3/4)IPyTorch is a successor of Caffe2, which is inspired by Chainer.9 / 64

PyTorch (3/4)IPyTorch is a successor of Caffe2, which is inspired by Chainer.IIt is an imperative DL framework using dynamic computation graphs and automaticdifferentiation.9 / 64

PyTorch (3/4)IPyTorch is a successor of Caffe2, which is inspired by Chainer.IIt is an imperative DL framework using dynamic computation graphs and automaticdifferentiation.IPyTorch mainly focuses on ease of use, and enables users with options in trainingtheir models.9 / 64

PyTorch (3/4)IPyTorch is a successor of Caffe2, which is inspired by Chainer.IIt is an imperative DL framework using dynamic computation graphs and automaticdifferentiation.IPyTorch mainly focuses on ease of use, and enables users with options in trainingtheir models.IPyTorch RPC is developed to support model parallelism.9 / 64

PyTorch (4/4)IPyTorch Distributed Data Parallel (DPP) is an extra feature to PyTorch (availableas of v1.5).10 / 64

PyTorch (4/4)IPyTorch Distributed Data Parallel (DPP) is an extra feature to PyTorch (availableas of v1.5).IPyTorch DDP utilizes some techniques to increase performance, such as10 / 64

PyTorch (4/4)IPyTorch Distributed Data Parallel (DPP) is an extra feature to PyTorch (availableas of v1.5).IPyTorch DDP utilizes some techniques to increase performance, such as Gradient bucketing (small tensors bucket into one allreduce operation)10 / 64

PyTorch (4/4)IPyTorch Distributed Data Parallel (DPP) is an extra feature to PyTorch (availableas of v1.5).IPyTorch DDP utilizes some techniques to increase performance, such as Gradient bucketing (small tensors bucket into one allreduce operation)Overlapping communication with computation10 / 64

PyTorch (4/4)IPyTorch Distributed Data Parallel (DPP) is an extra feature to PyTorch (availableas of v1.5).IPyTorch DDP utilizes some techniques to increase performance, such as Gradient bucketing (small tensors bucket into one allreduce operation)Overlapping communication with computationSkipping synchronization10 / 64

MXNet (1/2)IMXNet is a multi-language ML library.11 / 64

MXNet (1/2)IMXNet is a multi-language ML library.IIt blends declarative symbolic expression with imperative tensor computation.11 / 64

MXNet (1/2)IMXNet is a multi-language ML library.IIt blends declarative symbolic expression with imperative tensor computation.IIt uses a distributed key-value store for data synchronization over multiple devices.11 / 64

MXNet (2/2)IMXNet-MPI is the extension of MXNet that replaces each worker in a parameterserver architecture with a group of workers.IWorkers of each group are synced together using an MPI collective operation.12 / 64

MXNet (2/2)IMXNet-MPI is the extension of MXNet that replaces each worker in a parameterserver architecture with a group of workers.IWorkers of each group are synced together using an MPI collective operation.[Mamidala et al., MXNet-MPI: Embedding MPI parallelism in Parameter Server Task Model for Scaling Deep Learning, 2018]12 / 64

HorovodIHorovod is a stand-alone Python library for data parallelism using an optimizedring allreduce collective and a tensor fusion algorithm.13 / 64

HorovodIHorovod is a stand-alone Python library for data parallelism using an optimizedring allreduce collective and a tensor fusion algorithm.IIt works on top of another DL framework (e.g., TensorFlow, PyTorch, and MXNET).13 / 64

HorovodIHorovod is a stand-alone Python library for data parallelism using an optimizedring allreduce collective and a tensor fusion algorithm.IIt works on top of another DL framework (e.g., TensorFlow, PyTorch, and MXNET).IIt has one of the most optimized asynchronous collectives.13 / 64

HorovodIHorovod is a stand-alone Python library for data parallelism using an optimizedring allreduce collective and a tensor fusion algorithm.IIt works on top of another DL framework (e.g., TensorFlow, PyTorch, and MXNET).IIt has one of the most optimized asynchronous collectives.IHowever, the communication overhead significantly grows with the number of nodes.13 / 64

FlexFlowIFlexFlow can parallelize a DNN in the Sample, Operation, Attribute, and Parameter(SOAP) dimensions.14 / 64

FlexFlowIFlexFlow can parallelize a DNN in the Sample, Operation, Attribute, and Parameter(SOAP) dimensions.IIt uses guided randomized search of the SOAP space to find a fast parallelizationstrategy for a specific parallel machine.[Jia et al., Beyond Data and Model Parallelism for Deep Neural Networks, 2019]14 / 64

BigDLIBigDL is a distributed DL framework for data parallelism on top of Spark.15 / 64

BigDLIBigDL is a distributed DL framework for data parallelism on top of Spark.IIt does not support model parallelism.15 / 64

BigDLIBigDL is a distributed DL framework for data parallelism on top of Spark.IIt does not support model parallelism.IIt favors coarse-grained operations where data transformations are immutable.15 / 64

BigDLIBigDL is a distributed DL framework for data parallelism on top of Spark.IIt does not support model parallelism.IIt favors coarse-grained operations where data transformations are immutable.IIt runs a series of Spark jobs, which are scheduled by Spark.15 / 64

BigDLIBigDL is a distributed DL framework for data parallelism on top of Spark.IIt does not support model parallelism.IIt favors coarse-grained operations where data transformations are immutable.IIt runs a series of Spark jobs, which are scheduled by Spark.IDue to using Spark, it is equipped with fault tolerance and a fair load balancingmechanism.15 / 64

ZeRO and DeepSpeedIZeRO focuses on solving the memory limitation problem while attempting to minimizethe overhead.16 / 64

ZeRO and DeepSpeedIZeRO focuses on solving the memory limitation problem while attempting to minimizethe overhead.IIt partitions activations, optimizer states, gradients, and parameters and distributesthem equally overall available nodes.16 / 64

ZeRO and DeepSpeedIZeRO focuses on solving the memory limitation problem while attempting to minimizethe overhead.IIt partitions activations, optimizer states, gradients, and parameters and distributesthem equally overall available nodes.IIt then employs overlapping collective operations to reconstruct the tensors as needed.16 / 64

ZeRO and DeepSpeedIZeRO focuses on solving the memory limitation problem while attempting to minimizethe overhead.IIt partitions activations, optimizer states, gradients, and parameters and distributesthem equally overall available nodes.IIt then employs overlapping collective operations to reconstruct the tensors as needed.IDeepSpeed brings ZeRO techniques through lightweight APIs compatible with PyTorch.16 / 64

BigDL: A Distributed Deep LearningFramework for Big Data17 / 64

Big Data vs. Deep Learning FrameworksIBig data and deep learning systems have different distributed execution model.18 / 64

Big Data vs. Deep Learning FrameworksIBig data and deep learning systems have different distributed execution model.IBig data tasks are embarrassingly parallel and independent of each other.18 / 64

Big Data vs. Deep Learning FrameworksIBig data and deep learning systems have different distributed execution model.IBig data tasks are embarrassingly parallel and independent of each other.IDeep learning tasks need to coordinate with and depend on others.18 / 64

Big Data vs. Deep Learning FrameworksIBig data and deep learning systems have different distributed execution model.IBig data tasks are embarrassingly parallel and independent of each other.IDeep learning tasks need to coordinate with and depend on others.ISeveral connectors, e.g., TFX, CaffeOnSpark, TensorFlowOnSpark, SageMaker.18 / 64

Big Data vs. Deep Learning FrameworksIBig data and deep learning systems have different distributed execution model.IBig data tasks are embarrassingly parallel and independent of each other.IDeep learning tasks need to coordinate with and depend on others.ISeveral connectors, e.g., TFX, CaffeOnSpark, TensorFlowOnSpark, SageMaker.IHowever, the adaptation between different frameworks can impose very large overheads in practice.18 / 64

Spark Dataflow ModelIJob is described based on directed acyclic graphs (DAG) data flow.19 / 64

Spark Dataflow ModelIJob is described based on directed acyclic graphs (DAG) data flow.IA data flow is composed of any number of data sources, operators, and data sinksby connecting their inputs and outputs.19 / 64

Spark Dataflow ModelIJob is described based on directed acyclic graphs (DAG) data flow.IA data flow is composed of any number of data sources, operators, and data sinksby connecting their inputs and outputs.IParallelizable operators19 / 64

Resilient Distributed Datasets (RDD) (1/2)IA distributed memory abstraction.IImmutable collections of objects spread across a cluster. Like a LinkedList MyObjects 20 / 64

Resilient Distributed Datasets (RDD) (2/2)IAn RDD is divided into a number of partitions, which are atomic pieces of information.IPartitions of an RDD can be stored on different nodes of a cluster.21 / 64

Spark Execution Model[Dai et al., BigDL: A Distributed Deep Learning Framework for Big Data, 2019]22 / 64

BigDLIDirectly implements the distributed deep learning support in Spark.23 / 64

BigDLIDirectly implements the distributed deep learning support in Spark.IAn data-analytics integrated deep learning pipeline can be executed as a standardSpark jobs.23 / 64

BigDLIDirectly implements the distributed deep learning support in Spark.IAn data-analytics integrated deep learning pipeline can be executed as a standardSpark jobs.23 / 64

Data-Parallel Training in BigDL (1/3)IBigDL provides synchronous data-parallel training to train an NN model.24 / 64

Data-Parallel Training in BigDL (1/3)IBigDL provides synchronous data-parallel training to train an NN model.IRDD of Samples, which are automatically partitioned across the Spark cluster.[Dai et al., BigDL: A Distributed Deep Learning Framework for Big Data, 2019]24 / 64

Data-Parallel Training in BigDL (1/3)IBigDL provides synchronous data-parallel training to train an NN model.IRDD of Samples, which are automatically partitioned across the Spark cluster.IRDD of models, each of which is a replica of the original NN model.[Dai et al., BigDL: A Distributed Deep Learning Framework for Big Data, 2019]24 / 64

Data-Parallel Training in BigDL (2/3)IIn each iteration, a single model forward-backward Spark job.[Dai et al., BigDL: A Distributed Deep Learning Framework for Big Data, 2019]25 / 64

Data-Parallel Training in BigDL (2/3)IIn each iteration, a single model forward-backward Spark job.IApplies the functional zip operator to the co-located partitions of the two RDDs.[Dai et al., BigDL: A Distributed Deep Learning Framework for Big Data, 2019]25 / 64

Data-Parallel Training in BigDL (2/3)IIn each iteration, a single model forward-backward Spark job.IApplies the functional zip operator to the co-located partitions of the two RDDs.IThen, computes the local gradients in parallel for each model replica.[Dai et al., BigDL: A Distributed Deep Learning Framework for Big Data, 2019]25 / 64

Data-Parallel Training in BigDL (3/3)[Dai et al., BigDL: A Distributed Deep Learning Framework for Big Data, 2019]26 / 64

Parameter Synchronization in BigDL (1/2)IParameter synchronization based using parameter server or AllReduce requires finegrained data access.27 / 64

Parameter Synchronization in BigDL (1/2)IParameter synchronization based using parameter server or AllReduce requires finegrained data access.IFine-grained operations are not supported by Spark.27 / 64

Parameter Synchronization in BigDL (1/2)IParameter synchronization based using parameter server or AllReduce requires finegrained data access.IFine-grained operations are not supported by Spark.IBigDL directly implements an efficient AllReduce-like operation using existing primitives in Spark.27 / 64

Parameter Synchronization in BigDL (2/2)[Dai et al., BigDL: A Distributed Deep Learning Framework for Big Data, 2019]28 / 64

PyTorch Distributed: Experiences onAccelerating Data Parallel Training29 / 64

PyTorch (1/2)IPyTorch organizes values into Tensors, generic n-dimensional arrays.30 / 64

PyTorch (1/2)IPyTorch organizes values into Tensors, generic n-dimensional arrays.IA Module defines a transform from input values to output values.30 / 64

PyTorch (1/2)IPyTorch organizes values into Tensors, generic n-dimensional arrays.IA Module defines a transform from input values to output values. In this calss, applications provide their model at construction time.30 / 64

PyTorch (1/2)IPyTorch organizes values into Tensors, generic n-dimensional arrays.IA Module defines a transform from input values to output values. In this calss, applications provide their model at construction time.Its behavior during the forward pass is specified by its forward member function.30 / 64

PyTorch (1/2)IPyTorch organizes values into Tensors, generic n-dimensional arrays.IA Module defines a transform from input values to output values. IIn this calss, applications provide their model at construction time.Its behavior during the forward pass is specified by its forward member function.A Module can contain Tensors as parameters.30 / 64

PyTorch (1/2)IPyTorch organizes values into Tensors, generic n-dimensional arrays.IA Module defines a transform from input values to output values. IIn this calss, applications provide their model at construction time.Its behavior during the forward pass is specified by its forward member function.A Module can contain Tensors as parameters. A LinearModule contains a weight and a bias parameter.30 / 64

PyTorch (1/2)IPyTorch organizes values into Tensors, generic n-dimensional arrays.IA Module defines a transform from input values to output values. IIn this calss, applications provide their model at construction time.Its behavior during the forward pass is specified by its forward member function.A Module can contain Tensors as parameters. A LinearModule contains a weight and a bias parameter.Whose forward function generates the output by multiplying the input with the weightand adding the bias.30 / 64

PyTorch (1/2)IPyTorch organizes values into Tensors, generic n-dimensional arrays.IA Module defines a transform from input values to output values. IA Module can contain Tensors as parameters. IIn this calss, applications provide their model at construction time.Its behavior during the forward pass is specified by its forward member function.A LinearModule contains a weight and a bias parameter.Whose forward function generates the output by multiplying the input with the weightand adding the bias.An application composes its own Module by stitching together Modules (e.g., linear,convolution) and Functions (e.g., relu, pool) in a forward function.30 / 64

PyTorch (2/2)31 / 64

Data Parallelism in PyTorch (1/4)IPyTorch provides distributed data parallel as an nn.Module class.32 / 64

Data Parallelism in PyTorch (1/4)IPyTorch provides distributed data parallel as an nn.Module class.IAll replicas start from the same initial values for model parameters.32 / 64

Data Parallelism in PyTorch (1/4)IPyTorch provides distributed data parallel as an nn.Module class.IAll replicas start from the same initial values for model parameters.IThey synchronize gradients to keep parameters consistent across training iterations.32 / 64

Data Parallelism in PyTorch (2/4)IPyTorch offers several tools to facilitate distributed training.33 / 64

Data Parallelism in PyTorch (2/4)IPyTorch offers several tools to facilitate distributed training.IDataParallel for data parallel training on the same machine.33 / 64

Data Parallelism in PyTorch (2/4)IPyTorch offers several tools to facilitate distributed training.IDataParallel for data parallel training on the same machine.IDistributedDataParallel (DDP) for data parallel training across GPUs and machines.33 / 64

Data Parallelism in PyTorch (2/4)IPyTorch offers several tools to facilitate distributed training.IDataParallel for data parallel training on the same machine.IDistributedDataParallel (DDP) for data parallel training across GPUs and machines.IRPC for general distributed model parallel training.33 / 64

Data Parallelism in PyTorch (3/4)IDDP module enables data parallel training across multiple processes and machines.[Li et al., PyTorch Distributed:Experiences on Accelerating Data Parallel Training, 2020]34 / 64

Data Parallelism in PyTorch (3/4)IDDP module enables data parallel training across multiple processes and machines.IAllReduce is the primitive communication API used by DDP.[Li et al., PyTorch Distributed:Experiences on Accelerating Data Parallel Training, 2020]34 / 64

Data Parallelism in PyTorch (3/4)IDDP module enables data parallel training across multiple processes and machines.IAllReduce is the primitive communication API used by DDP.IIt is supported by multiple communication libraries, including NCCL, Gloo, and MPI.[Li et al., PyTorch Distributed:Experiences on Accelerating Data Parallel Training, 2020]34 / 64

Data Parallelism in PyTorch (4/4)35 / 64

Data Parallelism in PyTorch (4/4)35 / 64

Gradient Reduction - Naive Solution (1/3)IDDP guarantees correctness by letting all training processes:36 / 64

Gradient Reduction - Naive Solution (1/3)IDDP guarantees correctness by letting all training processes:1. Start from the same model state.36 / 64

Gradient Reduction - Naive Solution (1/3)IDDP guarantees correctness by letting all training processes:1. Start from the same model state.2. Consume the same gradients in every iteration.36 / 64

Gradient Reduction - Naive Solution (1/3)IDDP guarantees correctness by letting all training processes:1. Start from the same model state.2. Consume the same gradients in every iteration.IStep 1 can be achieved by broadcasting model states from one process to all others.36 / 64

Gradient Reduction - Naive Solution (1/3)IDDP guarantees correctness by letting all training processes:1. Start from the same model state.2. Consume the same gradients in every iteration.IStep 1 can be achieved by broadcasting model states from one process to all others.IStep 2 can be achieved by inserting a gradient synchronization phase after the localbackward pass and before updating local parameters.36 / 64

Gradient Reduction - Naive Solution (2/3)ITo implement the step 2, the PyTorch accepts custom backward hooks.37 / 64

Gradient Reduction - Naive Solution (2/3)ITo implement the step 2, the PyTorch accepts custom backward hooks.IDDP can register autograd hooks to trigger computation after every backward pass.37 / 64

Gradient Reduction - Naive Solution (2/3)ITo implement the step 2, the PyTorch accepts custom backward hooks.IDDP can register autograd hooks to trigger computation after every backward pass.IWhen fired, each hook scans through all local model parameters, and retrieves thegradient tensor from each parameter.37 / 64

Gradient Reduction - Naive Solution (2/3)ITo implement the step 2, the PyTorch accepts custom backward hooks.IDDP can register autograd hooks to trigger computation after every backward pass.IWhen fired, each hook scans through all local model parameters, and retrieves thegradient tensor from each parameter.IThen, it uses the AllReduce collective communication call to calculate the averagegradients on each parameter across all processes, and writes the result back to thegradient tensor.37 / 64

Gradient Reduction - Naive Solution (3/3)ITwo performance concerns:38 / 64

Gradient Reduction - Naive Solution (3/3)ITwo performance concerns:I1. Collective communication performs poorly on small tensors, which will be especially prominent on large models with massive numbers of small parameters.38 / 64

Gradient Reduction - Naive Solution (3/3)ITwo performance concerns:I1. Collective communication performs poorly on small tensors, which will be especially prominent on large models with massive numbers of small parameters.I2. Separating gradient computation and synchronization forfeits the opportunity tooverlap computation with communication due to the hard boundary in between.38 / 64

Gradient Reduction - Gradient Bucketing (1/2)ICollective communications are more efficient on large tensors.[Li et al., PyTorch Distributed:Experiences on Accelerating Data Parallel Training, 2020]39 / 64

Gradient Reduction - Gradient Bucketing (2/2)INot to launch AllReduce immediately after each gradient tensor becomes available.40 / 64

Gradient Reduction - Gradient Bucketing (2/2)INot to launch AllReduce immediately after each gradient tensor becomes available.IInstead, waits for a short period and buckets multiple gradients into one AllReduceoperation.40 / 64

Gradient Reduction - Gradient Bucketing (2/2)INot to launch AllReduce immediately after each gradient tensor becomes available.IInstead, waits for a short period and buckets multiple gradients into one AllReduceoperation.IBut not to communicate all gradients in one single AllReduce, otherwise, no communication can start before the computation is over.40 / 64

Gradient Reduction - Gradient Bucketing (2/2)INot to launch AllReduce immediately after each gradient tensor becomes available.IInstead, waits for a short period and buckets multiple gradients into one AllReduceoperation.IBut not to communicate all gradients in one single AllReduce, otherwise, no communication can start before the computation is over.IWith relatively small bucket sizes, DDP can launch AllReduce operations concurrentlywith the backward pass to overlap communication with computation.40 / 64

Overlap Computation with Communication (1/2)IAllReduce on gradients can start before the local backward pass finishes.41 / 64

Overlap Computation with Communication (1/2)IAllReduce on gradients can start before the local backward pass finishes.IWith bucketing, DDP needs to wait for all contents in the same bucket before launching communications.41 / 64

Overlap Computation with Communication (1/2)IAllReduce on gradients can start before the local backward pass finishes.IWith bucketing, DDP needs to wait for all contents in the same bucket before launching communications.IDDP registers one autograd hook for each gradient accumulator.41 / 64

Overlap Computation with Communication (1/2)IAllReduce on gradients can start before the local backward pass finishes.IWith bucketing, DDP needs to wait for all contents in the same bucket before launching communications.IDDP registers one autograd hook for each gradient accumulator.IThe hook fires after its corresponding accumulator updating the gradients.41 / 64

Overlap Computation with Communication (1/2)IAllReduce on gradients can start before the local backward pass finishes.IWith bucketing, DDP needs to wait for all contents in the same bucket before launching communications.IDDP registers one autograd hook for each gradient accumulator.IThe hook fires after its corresponding accumulator updating the gradients.IIf hooks of all gradients in the same buckets have fired, then AllReduce on thatbucket will be triggered.41 / 64

Overlap Computation with Communication (2/2)IThe reducing order must be the same across all processes, otherwise, AllReducecontents might mismatch.[Li et al., PyTorch Distributed:Experiences on Accelerating Data Parallel Training, 2020]42 / 64

Overlap Computation with Communication (2/2)IThe reducing order must be the same across all processes, otherwise, AllReducecontents might mismatch.IAll processes must use the same bucketing order[Li et al., PyTorch Distributed:Experiences on Accelerating Data Parallel Training, 2020]42 / 64

Overlap Computation with Communication (2/2)IThe reducing order must be the same across all processes, otherwise, AllReducecontents might mismatch.IAll processes must use the same bucketing orderINo process can launch AllReduce on bucket i 1 before embarking bucket i.[Li et al., PyTorch Distributed:Experiences on Accelerating Data Parallel Training, 2020]42 / 64

Gradient AccumulationIReduce gradient synchronization frequencies to speed up distributed data paralleltraining.43 / 64

Gradient AccumulationIReduce gradient synchronization frequencies to speed up distributed data par

Distributed Machine Learning Frameworks Amir H. Payberah payberah@kth.se 2020-12-07. The Course Web Page https://fid3024.github.io 1/64. Review of the Current Frameworks 2/64. TensorFlow (1/4) I TensorFlowsupportsdata parallelismandmodel partitioning(as ofv0.8). I As of v2.2, theMulti Worker Mirrored Strategy(allreduce) is integrated into Tensor-