Determinism In Deep Learning (S9911)

Transcription

Determinism in Deep Learning (S9911)Duncan Riach, GTC 20191

RANDOMNESSPseudo-random number generationRandom mini-batchingStochastic gradient descentData augmentationRegularization / generalization2

DETERMINISMElimination of truly random effectsBit-exact reproducibility from run-to-runSame model weightsSame inference resultsSame graph generated3

GOALSReasonably high performanceNo changes to models4

GUARANTEED FOR SAMEnumber of GPUsGPU architecturedriver versionCUDA versioncuDNN versionframework versiondistribution setup5

ADVANTAGESAUDITINGEXPERIMENTATIONIn safety-critical applicationsHold all independent variablesconstantDEBUGGINGREGRESSIONReproduce a failure in a long runRe-factor without introducing bugs66

“Correct”rangereferenceafter changeProbability of Accuracy(on given run)Model Accuracy7

BELIEFS“TensorFlow is inherently non-deterministic.”“GPUs are inherently non-deterministic.”“This problem can’t be solved.”“Nobody cares about this.”“Non-determinism is required for high-performance.”“It’s easy. Just set the seeds.”8

HYPOTHESESrandom seedstf.reduce sum / tf.reduce meanbroadcast addition (for adding bias)TensorFlow autotunegate gradientsTensorRTasynchronous reductionsGEMM split between thread-blockseigen kernelsmax-poolingdistributed gradient updatemulti-threading in the data loaderimage and video decodingdata augmentationCPU computeCUDA atomicAdd()9

TWO-SIGMA BLOG POST“A Workaround for Non-Determinism in TensorFlow”bit.ly/two-sigma-determinismtf.reduce sum()add bias using tf.add()10

WORK-AROUND PART 1input tf.constant([[1, 2, 3], [4, 5, 6]])1 2 3 4 5 6b tf.ones like(a)1 1 1 1 1 11 2 34 5 6deterministic sum tf.matmul(a, b, transpose b True)21a tf.reshape(input, [1, -1])11

WORK-AROUND PART 2layer outputso i*w bw00 w01 w02 w03w10 w11 w12 w13b0 b1 b2 b3layerinputsbatchi00 i01i10 i11i20 i22111o00 o01 o02 o03o10 o11 o12 o13o20 o21 o22 o23deterministic mm with bias tf.matmul(concat 1(i), concat(w, b))12

BUT NOWtf.reduce sum() is deterministictf.add() is deterministic13

SOLVE A REAL PROBLEMProject MagLev: at-scale machine-learning platform2D object detection model for autonomous vehiclesProduction scale:Millions of trainable variablesMillions of training examples14

bit.ly/how-to-debug15

HOW TO DEBUGDetermine what is workingDetermine precisely what is not workingGenerate hypothesesTest hypotheses using divide and conquer16

modelloss functiontargetexamplestorefopfopfopvarvarvardata loaderpredictionwgradbopboplossbopdgradGenericDeep LearningProcessback-prop17

DETERMINISM DEBUG TOOLInsert probe ops at various places in graphTrain the model twiceIdentifies location and step of non-determinism injection18

DETERMINISM DEBUG TOOLfrom tensorflow-determinism import probetensorflow op output probe.monitor(tensorflow op output, "name for place in graph")19

DETERMINISM DEBUG TOOLInserts back-propagatable monitor ops for: list, named-tuple, dict, or element element is int, float, string, or tf.Tensor (includingzero-dimensional tensor) recursively, e.g. list-of-named-tuples-of-elements20

DETERMINISM DEBUG TOOLSome of the other types of monitors: probe.monitor keras()For monitoring output of Keras layerprobe.monitor gradients()Place between compute gradients() and apply gradients()probe.summarize trainable variables()Use before training, after each step, or at the end of trainingAlso monitoring tools for tf.estimator and tf.keras, gradients and trainablevariables21

.22

CONVOLUTIONBack-Prop to Weight GradientsCONVOLUTION OUTPUTNFILTER (per output channel)C inputWreductionweight gradients loss / weights(per output channel)Houtput gradients loss / convolution output(per output channel)23

CONVOLUTIONBack-Prop to Data GradientsC outputC inputWHinput gradients loss / input(per batch index)reductionWHINPUT (per batch index)output gradients loss / output(per batch index)OUTPUT (per batch index)24

CONVOLUTIONMatrix-Multiplication Hierarchical ReductionFinal gradientsreductionpartial gradients fromeach thread-block25

CONVOLUTIONCUDA atomicAdd()0x10000: 1.00x10000: 6.0Warp BWarp A0x10000: 10.00x10000: 1.0atomicAdd(0x10000, 5.0)atomicAdd(0x10000, 4.0)0x10000: 5.00x10000: 10.0Atomics UnitMemory26

CONVOLUTIONatomicAdd() AdvantagesSerializes operations without stalling parallel threadsAssures atomic read-modify-write of memoryi.e. avoids race conditionsVery easy to programNo need to synchronize between thread-blocksVery fast read-modify-write loop near memory/cache27

CONVOLUTIONFloating-Point Rounding ErrorsusuallyAA B B ABusually! B C CA28

CONVOLUTIONRoot Cause and SolutionCUDA atomicAdd() export TF CUDNN DETERMINISTIC true python tf training script.pyTensorFlow cuDNN auto-tuningTF CUDNN DETERMINISTICto disable auto-tuning and selectdeterministic cuDNN convolutionalgorithmsAdded to TensorFlow masterbranch: bit.ly/tf-pr-24747#!/usr/bin/pythonimport osimport tensorflow as tfos.environ[‘TF CUDNN DETERMINISTIC’] ’true’# build a graph29

BIAS ADDITIONRoot CauseNBIAS ADD OUTPUT (per output channel))bias gradient loss / bias(single valueper output channel)WreductionBIAS VALUE(per output channel)Houtput gradients loss / bias add output(per output channel)tensorflow.python.ops.nn.bias add() uses CUDA atomicAdd()30

BIAS ADDITIONTemporary SolutionDynamically patch tensorflow.python.ops.nn.bias add()Use deterministic ops including implicit broadcastingif data format 'NCHW':value tf.math.add(value, tf.reshape(bias, (1, tf.size(bias), 1, 1)))elif data format 'NHWC' or data format None:value tf.math.add(value, bias)from tensorflow-determinism import patchpatch.bias add()31

RARER NON-DETERMINISMtf.nn.fused batch norm() back-prop Approximately every 10 steps Temporary solution: run on CPUgate gradients tf.train.Optimizer.GATE OP (default) optimizer.compute gradients() parameter Approximately every 100 steps GATE GRAPH is guaranteed to be deterministic32

RAREST NON-DETERMINISMEvery few thousand steps at random locationsChanged from Pascal to Volta card non-determinism persistedAdded ability to dump and compare probed tensors between runsSuspected memory allocation and ownership (time / location)Ran on cluster fully deterministicUpdated my driver fully deterministic locallyPossible causes: off-by-one memory allocation, incorrect cacheinvalidation, race conditions, clock speed, interface trimsbatch-norm and gate gradients fixes not required33

INTERIM STATUSAutonomous-vehicleproduction modeltraining fullydeterministically andcorrectly on millions ofexamplesTensorFlow determinismdebugging tool developedDeterministic cuDNNconvolution fixesupstreamed toTensorFlow masterbranch34

SINGLE GPU PERFORMANCEProprietary AV Perception ModelWith unoptimizedbias-add solution6% DECREASEdeterministicnondeterministic35

MULTI-GPU WITH HOROVODBased on single-GPU determinism recipeTwo GPUs: deterministic out-of-the-boxMore than two GPUs non-deterministicHorovod uses NCCL2 ring-allreduce36

RING-ALLREDUCESTEP 1STEP 2STEP 3GPU 1ABCGPU 2GPU 4DGPU 337Patarasuk, Pitch & Yuan, Xin. (2007). Bandwidth Efficient All-reduce Operation on Tree Topologies. 1 - 8. 10.1109/IPDPS.2007.370405.

HOROVOD TENSOR FUSIONBatch-reduce partial gradient tensors as they become readyOrder of reduction changes on each training step (apparently)For now: disable Tensor Fusion HOROVOD FUSION THRESHOLD 0 python train.py38

MULTI-GPU PERFORMANCEUsing Single-GPU Determinism RecipeTensor FusionTensor FusionTensor Fusion39

ANOTHER REAL PROBLEMGE HealthcareSegmentation and LabelingCT : BoneVCARAlerts for Critical ConditionsX-Ray : GE Critical Care SuiteOptimal ScansMR : GE MR AIRx40

MAX-POOLING1 2 1 11 3 1 11 1 1 1reduction3 3 13 3 141

MAX-POOLINGRoot Cause & SolutionCUDA atomicAdd() export TF CUDNN DETERMINISTIC true python tf training script.pyTF CUDNN DETERMINISTICAdded to TensorFlow masterbranch: bit.ly/tf-pr-25269#!/usr/bin/pythonimport osimport tensorflow as tfos.environ[‘TF CUDNN DETERMINISTIC’] ’true’# build a graph42

CPU NON-DETERMINISMNoticed while I was debugging the distilled modelMuch greater variance than GPUInjection occuring at weight update stepSolution: Use single CPU threadsession config.intra op parallelism threads 1 (default: 2)session config.inter op parallelism threads 1 (default: 5)Only needed when running on CPU (vs GPU)43

CPUSUM OF WEIGHTS FINAL LOSS GPUSUM OF WEIGHTS FINAL LOSS Training five times with no fixes-13.4960977323353291 6.1724668502807614-9.3681446192786098 6.3305957317352295-9.1963089210912585 6.3364742755889889-13.6303959703072906 6.1670220375061033-9.0079690776765347 6.3340478420257567Training five times with no fixes-13.5144761633127928 6.1083775520324703-13.5144743174314499 6.1083775520324703-13.5144757004454732 6.1083775520324703-13.5144734960049391 6.1083775997161869-13.5144746471196413 6.1083775997161869Training twice with all fixes-9.6487178248353302 6.1068549633026121-9.6487178248353302 6.1068549633026121Training twice with all fixes-13.5144764725118876 6.1083775997161869-13.5144764725118876 6.1083775997161869Training bigger config twice with all fixes-8.8775541735813022 4.1930521011352537 (66.96 s)-8.8775541735813022 4.1930521011352537 (66.70 s)Training bigger config twice with all fixes3.7987217940390110 3.9343416929244994 (2.43 s)3.7987217940390110 3.9343416929244994 (2.41 s)44

COMPLETE RECIPE1.Set TF CUDNN DETERMINISTIC true Disables TensorFlow cuDNN auto-tuning Uses deterministic cuDNN convolution back-prop algorithms Uses deterministic cuDNN max-pooling algorithm2.Dynamically patch tf.nn.bias add()3.Set random seed for all random number generators 4.random.seed(SEED), np.random.seed(SEED),tf.set random seed(SEED)HOROVOD FUSION THRESHOLD 0 for more than 2 GPUs45

TENSORFLOW & CUDA ATOMICSAnalysis of TF v1.12 , v1.13.1, and master branch (on 2019-03-03)About 13 ops that use CUDA atomicAdd()There are ten other CUDA atomic operations, e.g. atomicCAS()‘atomic’ is present in 167 files in the TensorFlow repoSome of these may be related to CUDA atomicsCUDA atomics not always associated with non-determinismThere are faster, deterministic ways to reduce within thread-blocksi.e logarithmic tree reductions using inter-thread shuffling46

INFERENCEAll forward propagation (of course) Probably no need to set TF CUDNN DETERMINISTIC true Possible issues with “deconvolution”Disable TensorFlow cuDNN autotuning Set TF CUDNN USE AUTOTUNE falseTensorRT 500 CUDA kernels, all of them deterministic Timing-based auto-tuning running on target architecture can produce differentgraphs on each run We’re working on adding a mechanism to TensorRT to address this47

PYTORCHSet all the seedsrandom.seed(SEED), np.random.seed(SEED),os.environ['PYTHONHASHSEED'] str(SEED),torch.manual seed(SEED),torch.cuda.manual seed all(SEED)torch.backends.cudnn.deterministic TrueCovers convolution and max-poolingI hear that some ops may still be non-deterministic48

PLANRelease current solution in NGC TensorFlow containerTF CUDNN DETERMINISTIC in TensorFlow v2.0 (end-of-year)Make bias add deterministic at CUDA kernel levelOpen-source determinism debug toolAdd single deterministic switch for all of TensorFlowImprove deterministic performance of HorovodDeterministic simulated environments for reinforcement learning49

CREDITSTero KarrasTim ZamanHao WuJose Alvarez LopezBen BarsdellRakesh RanjanSimon LaytonJohn MontrymJorge Albericio LatorreNicolas KoumchatzkyCarl CaseNathan LuehrConrado Silva MirandaJussi RasanenDilip SequeiraMikko RonkainenXiang Bo KongSharan ChetlurLuke DurantKevin BrownMarc EdgarCindy RiachMostafa HagogYifang XuWilliam ZhangLauri PeltonenJoey ConwayMatthijs De SmedtKevin VincentBryan CatanzaroMichael O’ConnorStephen WarrenBob KeatingAndrew Kerr50

TAKEAWAYSNeither TensorFlow nor GPUs are inherently non-deterministicRoot cause is asynchronous floating point operationsUse CUDA floating-point atomic operations with careDeterministic kernels often already availableThis was a hard problem to solve, but not impossibleIt’s a very important topic. A lot of people care about itNew tools and methodology for debuggingAutomated vigilance is warranted51

CALL TO ACTIONwatch: github.com/NVIDIA/tensorflow-determinismfollow: twitter.com/DuncanARiachconnect: www.linkedin.com/in/duncanriachemail: duncan@nvidia.com52

Neither TensorFlow nor GPUs are inherently non-deterministic Root cause is asynchronous floating point operations Use CUDA floating-point atomic operations with care Deterministic kernels often already avail