ML Infrastructure Playbook - Lambda Labs

Transcription

ML Infrastructure PlaybookStephen BalabanCo-founder and CEOLambdaLAMBDALABS.COM

Outline1.2.3.Introduction to LambdaLambda AI Survey OverviewA playbook for getting starteda.b.4.A playbook for expansiona.b.c.5.Cloud vs On-prem vs HybridIncremental scalingScaling from workstations to serversi.Shared resourcesii.A quick hack for sharing GPUsiii.Set up jupyter notebookiv.Need more computev.Need GPUs with more memoryvi.Co-location servicesConsidering a software stack at every scalei.Smallii.Mediumiii.LargeScaling from servers to clustersi.Networkii.Poweriii.Colo / on-premiv.InfiniBandFinishing ThoughtsLAMBDALABS.COM

A bit about LambdaLAMBDALABS.COM

Lambda is building a future where scaling from a single GPUto an entire data center “just works”CloudLAMBDALABS.COM

Lambda provides the hardware & software to build ML infrastructure for your teamTensorbookVectorScalarEchelonGPU LaptopGPU WorkstationGPU ServerGPU ClusterLambda ColoLambda CloudCloudStackA managed, always up-to-date, software stack for Deep LearningLAMBDALABS.COM

In the beginning,there was aworkstation.LAMBDALABS.COM

“Our network takes between five and six days to train ontwo GTX 580 3GB GPUs. All of our experiments suggestthat our results can be improved simply by waiting forfaster GPUs and bigger datasets to become available.”- Krizhevsky, Sutskever & Hinton 2012LAMBDALABS.COM

Fast forward to 2021. Massivemodels reign. But, most workstill happens on a workstation.LAMBDALABS.COM

Lambda SurveyResultsLAMBDALABS.COM

The distribution ofcompute, like wealth,follows a power law.LAMBDALABS.COM

How many total GPUs does your lab have available fortraining Neural Networks?Source: Lambda AI Survey 2020-2021. N 647LAMBDALABS.COM

The distribution ofR&D headcount alsofollows a power law.LAMBDALABS.COM

How many ML researchers & engineers does yourteam have?Source: Lambda AI Survey 2020-2021. N 645LAMBDALABS.COM

Can you guess whatpercentage ofrespondents trainon-prem?LAMBDALABS.COM

Do you train on-prem or in a public cloud?Source: Lambda AI Survey 2020-2021. N 645LAMBDALABS.COM

A playbook forgetting startedLAMBDALABS.COM

Decide onCloud vsOn-prem vsHybridLAMBDALABS.COM

Good reasons to consider cloudGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUNeed to use 100GPUs now!LAMBDALABS.COMDoing productioninferenceSpikey /inconsistentworkload

Good reasons to consider on-prem More compute forless moneyLAMBDALABS.COM Data sovereignty& securityBig data sets &expensive egress

On-prem: start with a laptop or single workstationThis is what most researchers & engineers go to first and, unless you have a verygood reason as to why you’re special, you should be starting here too.LAMBDALABS.COM

Research the latest GPU benchmarks for the modelyou’ll be marksLAMBDALABS.COMhttps://mlperf.org

Lambda GPU Benchmarks throughput results forspecific model/GPU pairs3.33x Speedup1.0x2.0x3.0xMultiplicative SpeedupLAMBDALABS.COMSource: https://lambdalabs.com/gpu-benchmarks

Incrementally add compute for each new hireLAMBDALABS.COM

A playbook forexpansionLAMBDALABS.COM

From workstations to serversWhen your team starts to share resources, that’s usually a sign they should geta centralized server to provide compute to the whole organization.LAMBDALABS.COM

Simple hack for resource sharingCUDA VISIBLE DEVICES is your friend. Give everybody ssh access and useCUDA VISIBLE DEVICES to mask which GPU their process uses. For example, on an 8GPU server:Joe’s SSH Terminal: CUDA VISIBLE DEVICES 0,1 python train.pyFrancie’s Local Terminal: CUDA VISIBLE DEVICES 3,4,5,6,7 python train.pyLAMBDALABS.COM

Simple hack for resource sharingYou can also put their allocation directly into their .bashrcOn Joe’s account: cat /.bashrc EOFexport CUDA VISIBLE DEVICES 0EOFOn Fancie’s account: cat /.bashrc EOFexport CUDA VISIBLE DEVICES 1,2,3,4,5,6,7EOFLAMBDALABS.COM

Over time, this method will become unwieldy as youtry and figure out who is running what jobs on whichGPUsIs somebodyCan I run ajob onGPU2?else alreadyusing Node3?If I kill thisprocess willFrank beupset?LAMBDALABS.COM

When you get to that point, try using a job schedulerlike SLURM or KubeflowFrank’s JobSue’s JobJack’s JobJill’s JobActive JobsQueue InsertLAMBDALABS.COM

You can also set up separate notebook instancesLAMBDALABS.COM

Considering asoftware stack atevery scaleLAMBDALABS.COM

When you’re small,just use LambdaStackLAMBDALABS.COM

LAMBDA REPO (mktemp) && \wget -O {LAMBDA REPO} po.deb && \sudo dpkg -i {LAMBDA REPO} && rm -f {LAMBDA REPO} && \sudo apt-get update && sudo apt-get install -y lambda-stack-cudasudo rning-softwareLAMBDALABS.COM

Lambda Stack provides the same environment everywhereLambda TensorBookLambda VectorLambda BladeLambda EchelonLambda GPU ackServersStackClustersStackCloudStack

As you grow, considerusing Lambda Stack nvidia-container-toolkitLAMBDALABS.COM

Compatible with all Docker and NGC containersStacksudo apt-get install docker.io AMBDALABS.COM

At a certain scale,use containers andconsider an MLOpsplatformLAMBDALABS.COM

Lots to choose fromLAMBDALABS.COM

From servers toclustersLAMBDALABS.COM

Network considerationsAs you scale up your cluster, you’ll want to go from 1 Gbps to 10 Gbps to 100GbE and maybe even to 200 Gbps HDR InfiniBand. This depends on your team’sneed to do node-to-node communication and node-to-storage communication.1 GbpsEthernetLAMBDALABS.COM10 GbpsSFP Ethernet200 GbpsInfiniBand

Director switches: the final boss of the networking world Use a backplane instead of cables to build out the internal(spine leaf) network topology. To get 100% non-blocking bandwidth in the CS7520 youNEED to have all 6 spines installed. CS7520 has 216 southbound ports (and thus 216northbound ports). Assuming each spine has only 36 ports, how many spinesdo you need to support 216 ports? 36 ports / spine, 216 ports needed 216 ports / 36 (ports / spine) 6 spine switchesFor more info see:https://www.mellanox.com/related-docs/prod ib switch systems/CS7520 Dismantling Guide.pdfLAMBDALABS.COMSource: How to build a GPU cluster for AI

Echelon Network Topology(Single Rack Configuration)Storage Fabric40-port 200 Gb/s InfiniBand Leaf SwitchCompute Fabric40-port 200 Gb/s InfiniBand Leaf Switch.Hyperplane 1Hyperplane NNVMe Storage Node NIn-Band Management Network100 Gb/s EthernetOut-of-Band Management Network1 Gb/s EthernetLegend: 8x 200 Gb/s QSFP56 Cable 2x 200 Gb/s QSFP56 Cable 1x 100 Gb/s QSFP28 Cable 1x 1 Gb/s QSFP28 CableLAMBDALABS.COM1U Management ServerSource: Lambda Echelon Whitepaper v1.2

To InfiniBand and Beyond!Spine ALegendSpine BA 40-port 200Gb/s IB HDR SwitchEach thick grey line represents 10InfiniBand cables.Each thin black line represents 1 InfiniBandcable.Leaf A20LAMBDALABS.COMLeaf CLeaf B20Leaf D2020

Storage considerationsOpen source storage optionsLAMBDALABS.COMProprietary storage optionsSource: Lambda Echelon Whitepaper v1.2

Power considerationsSingle phase systemsP (watts) V (volts) * I (amps)We use I from the French, intensité du courant.Because that’s what André-Marie Ampère used.3-phase systemsP 3 * V / sqrt(3) * IThis simplifies to P sqrt(3) * V * IBecause 3 / sqrt(3) sqrt(3)Real life 3-phase systemsP sqrt(3) * V * I * 0.8It’s very common to see an 80% regulatory deratingfactor applied to PDUs.LAMBDALABS.COM

How do PDU manufacturers calculate power capacity?From the APC8966 Data Sheet:Input frequency50/60 HzP sqrt(3) * V * I * 0.8Number of Power Cords1Plug in the numbers from the data sheet:Load Capacity17300VAMaximum Input Current60AMaximum Line Current48ARegulatory Derated Input Current (North America)48ANominal Output Voltage208VExample PDU:APC 8966LAMBDALABS.COMNominal Input Voltage208V 3PHInput ConnectionsIEC 60309 60A 3P PEP sqrt(3) * 208 * 60 * 0.8P 17292.7953P 17.3kVASee how they derate the maximum input current of60A to the “Regulatory Derated Input Current(North America)” 48A? That’s 48A 60A * 0.8.That’s where the 0.8 came from in the previousslide.

Plug types frequently seen in HPCIEC60309 - 60A 3-phase plug - 208VBlue means the system is between 200 and 250V.LAMBDALABS.COMIEC60309 - 60A 3-phase plug - 415VRed means the system is above 400V.NEMA L15-30P 30A 3-phase plug - 208V

Receptacles & plugs, continuedPlug PhotoNamePlugs intoReceptacle PhotoIEC C13 PlugIEC C14 Receptacle(on server)15A(Max power 2.5kW)IEC C14 PlugIEC C13 Receptacle(on PDU)15A(Max power 2.5kW)IEC C19 PlugIEC C20 Receptacle(on server)20A(Max power 3.2kW)IEC C20 PlugIEC C19 Receptacle(on PDU)20A(Max power 3.2kW)IEC stands for International Electrotechnical Commission, an international standards organization headquartered in Switzerland.LAMBDALABS.COMMax Amps

Colo vs on-premLAMBDALABS.COM

For a comprehensive clustering guidehttps://www.youtube.com/watch?v rfu5FwncZ6sLAMBDALABS.COM

Finishing thoughtsLAMBDALABS.COM

CloudLAMBDALABS.COM

DALABS.COM

Lambda is building a future where scaling from a single GPU . CUDA_VISIBLE_DEVICES 0,1 python train.py Francie's Local Terminal: CUDA_VISIBLE_DEVICES 3,4,5,6,7 python train.py. LAMBDALABS.COM . Lambda's Machine Learning Infrastructure Playbook .