Transcription
ML Infrastructure PlaybookStephen BalabanCo-founder and CEOLambdaLAMBDALABS.COM
Outline1.2.3.Introduction to LambdaLambda AI Survey OverviewA playbook for getting starteda.b.4.A playbook for expansiona.b.c.5.Cloud vs On-prem vs HybridIncremental scalingScaling from workstations to serversi.Shared resourcesii.A quick hack for sharing GPUsiii.Set up jupyter notebookiv.Need more computev.Need GPUs with more memoryvi.Co-location servicesConsidering a software stack at every scalei.Smallii.Mediumiii.LargeScaling from servers to clustersi.Networkii.Poweriii.Colo / on-premiv.InfiniBandFinishing ThoughtsLAMBDALABS.COM
A bit about LambdaLAMBDALABS.COM
Lambda is building a future where scaling from a single GPUto an entire data center “just works”CloudLAMBDALABS.COM
Lambda provides the hardware & software to build ML infrastructure for your teamTensorbookVectorScalarEchelonGPU LaptopGPU WorkstationGPU ServerGPU ClusterLambda ColoLambda CloudCloudStackA managed, always up-to-date, software stack for Deep LearningLAMBDALABS.COM
In the beginning,there was aworkstation.LAMBDALABS.COM
“Our network takes between five and six days to train ontwo GTX 580 3GB GPUs. All of our experiments suggestthat our results can be improved simply by waiting forfaster GPUs and bigger datasets to become available.”- Krizhevsky, Sutskever & Hinton 2012LAMBDALABS.COM
Fast forward to 2021. Massivemodels reign. But, most workstill happens on a workstation.LAMBDALABS.COM
Lambda SurveyResultsLAMBDALABS.COM
The distribution ofcompute, like wealth,follows a power law.LAMBDALABS.COM
How many total GPUs does your lab have available fortraining Neural Networks?Source: Lambda AI Survey 2020-2021. N 647LAMBDALABS.COM
The distribution ofR&D headcount alsofollows a power law.LAMBDALABS.COM
How many ML researchers & engineers does yourteam have?Source: Lambda AI Survey 2020-2021. N 645LAMBDALABS.COM
Can you guess whatpercentage ofrespondents trainon-prem?LAMBDALABS.COM
Do you train on-prem or in a public cloud?Source: Lambda AI Survey 2020-2021. N 645LAMBDALABS.COM
A playbook forgetting startedLAMBDALABS.COM
Decide onCloud vsOn-prem vsHybridLAMBDALABS.COM
Good reasons to consider cloudGPUGPUGPUGPUGPUGPUGPUGPUGPUGPUNeed to use 100GPUs now!LAMBDALABS.COMDoing productioninferenceSpikey /inconsistentworkload
Good reasons to consider on-prem More compute forless moneyLAMBDALABS.COM Data sovereignty& securityBig data sets &expensive egress
On-prem: start with a laptop or single workstationThis is what most researchers & engineers go to first and, unless you have a verygood reason as to why you’re special, you should be starting here too.LAMBDALABS.COM
Research the latest GPU benchmarks for the modelyou’ll be marksLAMBDALABS.COMhttps://mlperf.org
Lambda GPU Benchmarks throughput results forspecific model/GPU pairs3.33x Speedup1.0x2.0x3.0xMultiplicative SpeedupLAMBDALABS.COMSource: https://lambdalabs.com/gpu-benchmarks
Incrementally add compute for each new hireLAMBDALABS.COM
A playbook forexpansionLAMBDALABS.COM
From workstations to serversWhen your team starts to share resources, that’s usually a sign they should geta centralized server to provide compute to the whole organization.LAMBDALABS.COM
Simple hack for resource sharingCUDA VISIBLE DEVICES is your friend. Give everybody ssh access and useCUDA VISIBLE DEVICES to mask which GPU their process uses. For example, on an 8GPU server:Joe’s SSH Terminal: CUDA VISIBLE DEVICES 0,1 python train.pyFrancie’s Local Terminal: CUDA VISIBLE DEVICES 3,4,5,6,7 python train.pyLAMBDALABS.COM
Simple hack for resource sharingYou can also put their allocation directly into their .bashrcOn Joe’s account: cat /.bashrc EOFexport CUDA VISIBLE DEVICES 0EOFOn Fancie’s account: cat /.bashrc EOFexport CUDA VISIBLE DEVICES 1,2,3,4,5,6,7EOFLAMBDALABS.COM
Over time, this method will become unwieldy as youtry and figure out who is running what jobs on whichGPUsIs somebodyCan I run ajob onGPU2?else alreadyusing Node3?If I kill thisprocess willFrank beupset?LAMBDALABS.COM
When you get to that point, try using a job schedulerlike SLURM or KubeflowFrank’s JobSue’s JobJack’s JobJill’s JobActive JobsQueue InsertLAMBDALABS.COM
You can also set up separate notebook instancesLAMBDALABS.COM
Considering asoftware stack atevery scaleLAMBDALABS.COM
When you’re small,just use LambdaStackLAMBDALABS.COM
LAMBDA REPO (mktemp) && \wget -O {LAMBDA REPO} po.deb && \sudo dpkg -i {LAMBDA REPO} && rm -f {LAMBDA REPO} && \sudo apt-get update && sudo apt-get install -y lambda-stack-cudasudo rning-softwareLAMBDALABS.COM
Lambda Stack provides the same environment everywhereLambda TensorBookLambda VectorLambda BladeLambda EchelonLambda GPU ackServersStackClustersStackCloudStack
As you grow, considerusing Lambda Stack nvidia-container-toolkitLAMBDALABS.COM
Compatible with all Docker and NGC containersStacksudo apt-get install docker.io AMBDALABS.COM
At a certain scale,use containers andconsider an MLOpsplatformLAMBDALABS.COM
Lots to choose fromLAMBDALABS.COM
From servers toclustersLAMBDALABS.COM
Network considerationsAs you scale up your cluster, you’ll want to go from 1 Gbps to 10 Gbps to 100GbE and maybe even to 200 Gbps HDR InfiniBand. This depends on your team’sneed to do node-to-node communication and node-to-storage communication.1 GbpsEthernetLAMBDALABS.COM10 GbpsSFP Ethernet200 GbpsInfiniBand
Director switches: the final boss of the networking world Use a backplane instead of cables to build out the internal(spine leaf) network topology. To get 100% non-blocking bandwidth in the CS7520 youNEED to have all 6 spines installed. CS7520 has 216 southbound ports (and thus 216northbound ports). Assuming each spine has only 36 ports, how many spinesdo you need to support 216 ports? 36 ports / spine, 216 ports needed 216 ports / 36 (ports / spine) 6 spine switchesFor more info see:https://www.mellanox.com/related-docs/prod ib switch systems/CS7520 Dismantling Guide.pdfLAMBDALABS.COMSource: How to build a GPU cluster for AI
Echelon Network Topology(Single Rack Configuration)Storage Fabric40-port 200 Gb/s InfiniBand Leaf SwitchCompute Fabric40-port 200 Gb/s InfiniBand Leaf Switch.Hyperplane 1Hyperplane NNVMe Storage Node NIn-Band Management Network100 Gb/s EthernetOut-of-Band Management Network1 Gb/s EthernetLegend: 8x 200 Gb/s QSFP56 Cable 2x 200 Gb/s QSFP56 Cable 1x 100 Gb/s QSFP28 Cable 1x 1 Gb/s QSFP28 CableLAMBDALABS.COM1U Management ServerSource: Lambda Echelon Whitepaper v1.2
To InfiniBand and Beyond!Spine ALegendSpine BA 40-port 200Gb/s IB HDR SwitchEach thick grey line represents 10InfiniBand cables.Each thin black line represents 1 InfiniBandcable.Leaf A20LAMBDALABS.COMLeaf CLeaf B20Leaf D2020
Storage considerationsOpen source storage optionsLAMBDALABS.COMProprietary storage optionsSource: Lambda Echelon Whitepaper v1.2
Power considerationsSingle phase systemsP (watts) V (volts) * I (amps)We use I from the French, intensité du courant.Because that’s what André-Marie Ampère used.3-phase systemsP 3 * V / sqrt(3) * IThis simplifies to P sqrt(3) * V * IBecause 3 / sqrt(3) sqrt(3)Real life 3-phase systemsP sqrt(3) * V * I * 0.8It’s very common to see an 80% regulatory deratingfactor applied to PDUs.LAMBDALABS.COM
How do PDU manufacturers calculate power capacity?From the APC8966 Data Sheet:Input frequency50/60 HzP sqrt(3) * V * I * 0.8Number of Power Cords1Plug in the numbers from the data sheet:Load Capacity17300VAMaximum Input Current60AMaximum Line Current48ARegulatory Derated Input Current (North America)48ANominal Output Voltage208VExample PDU:APC 8966LAMBDALABS.COMNominal Input Voltage208V 3PHInput ConnectionsIEC 60309 60A 3P PEP sqrt(3) * 208 * 60 * 0.8P 17292.7953P 17.3kVASee how they derate the maximum input current of60A to the “Regulatory Derated Input Current(North America)” 48A? That’s 48A 60A * 0.8.That’s where the 0.8 came from in the previousslide.
Plug types frequently seen in HPCIEC60309 - 60A 3-phase plug - 208VBlue means the system is between 200 and 250V.LAMBDALABS.COMIEC60309 - 60A 3-phase plug - 415VRed means the system is above 400V.NEMA L15-30P 30A 3-phase plug - 208V
Receptacles & plugs, continuedPlug PhotoNamePlugs intoReceptacle PhotoIEC C13 PlugIEC C14 Receptacle(on server)15A(Max power 2.5kW)IEC C14 PlugIEC C13 Receptacle(on PDU)15A(Max power 2.5kW)IEC C19 PlugIEC C20 Receptacle(on server)20A(Max power 3.2kW)IEC C20 PlugIEC C19 Receptacle(on PDU)20A(Max power 3.2kW)IEC stands for International Electrotechnical Commission, an international standards organization headquartered in Switzerland.LAMBDALABS.COMMax Amps
Colo vs on-premLAMBDALABS.COM
For a comprehensive clustering guidehttps://www.youtube.com/watch?v rfu5FwncZ6sLAMBDALABS.COM
Finishing thoughtsLAMBDALABS.COM
CloudLAMBDALABS.COM
DALABS.COM
Lambda is building a future where scaling from a single GPU . CUDA_VISIBLE_DEVICES 0,1 python train.py Francie's Local Terminal: CUDA_VISIBLE_DEVICES 3,4,5,6,7 python train.py. LAMBDALABS.COM . Lambda's Machine Learning Infrastructure Playbook .