TensorFlow W/XLA: TensorFlow, Compiled!

Transcription

TensorFlow w/XLA:TensorFlow, Compiled!Expressiveness with performancePre-release Documentation (or search GitHub repository for ter/resources/xla prerelease.htmlJeff DeanGoogle Brain teamg.co/brainpresenting work done by the XLA team and Google Brain team

It takes a village to raise acompiler.- Ancient proverb

Why Did We Build TensorFlow?Wanted system that was flexible, scalable, and production-readyDistBelief, our first system, was good on two of these, but lacked flexibilityMost existing open-source packages were also good on 2 of 3 but not all 3

TensorFlow GoalsEstablish common platform for expressing machine learning ideas and systemsMake this platform the best in the world for both research and production useOpen source it so that it becomes a platform for everyone, not just Google

Facts and FiguresLaunched on Nov. 9, 2015Reasonably fully-featured:auto differentiation, queues, control flow, fairly comprehensive set of ops, .Tutorials made system accessibleOut-of-the-box support for CPUs, GPUs, multiple devices, multiple platforms

Some Stats500 contributors, most of them outside Google11,000 commits since Nov, 20151M binary downloads#16 most popular repository on GitHub by starsUsed in ML classes at quite a few universities now:Toronto, Berkeley, Stanford, Many companies/organizations using TensorFlow:Google, DeepMind, OpenAI, Twitter, Snapchat, Airbus, Uber, .

TensorFlow StrengthsFlexibleExpressiveExtensible

Just-In-Time Compilationvia XLA, "Accelerated Linear Algebra" compilerOptimized & specializedassembly comes out.TF graphs go dx), %raxvmovaps (%rax), %xmm0vmulps %xmm0, %xmm0, %xmm0vmovaps %xmm0, (%rdi).Let's explain that!

Demo:Inspect JIT code inTensorFlowiPython shellXLA:CPUXLA:GPU

What's JIT all about?Program built at runtimeLow-overhead compilationDim variables (e.g. batch size) can bind very latePrototype w/freedom of TF development

TF-Level Block DiagramTarget graphs explicitlyat an XLA "device"TensorFlowTF Auto-JITExisting TensorFlow CoreTF CPU Ops TF GPU Ops TF TPU OpsXLAXLA:CPUXLA:GPUXLA:TPU

TF-Level Block DiagramOr let TF findJIT-compilable op clustersfor you!TensorFlowTF Auto-JITExisting TensorFlow CoreTF CPU Ops TF GPU Ops TF TPU OpsXLAXLA:CPUXLA:GPUXLA:TPU

TF-Level Block DiagramThings that don't compilecan still be placed onexisting devicesTensorFlowTF Auto-JITExisting TensorFlow CoreTF CPU Ops TF GPU Ops TF TPU OpsXLAXLA:CPUXLA:GPUXLA:TPU

Complementary Attributes!FlexibleExpressiveExtensibleThink & write this way.InterpretedCompiledDynamicStateful"Black-Box" ModularStaticPurePrimitivesBut get optimizationbenefits of these!

What has us excited?Server-side speedupsXLA's JIT compilation and specializationSignificant performance winsSyntaxNet latency reductions: 200µs 5µs (extreme case)

What has us excited?Mobile footprint reductionsXLA's Ahead-of-Time compilationTurn models to executablesEliminates much of TensorFlow runtimeCross-compile for ARM, PPC, x86LSTM model for mobile: 1MB 10s of KBs

What has us excited?Whole-Program Analysis made easyXLA's High-Level OptimizerReusable toolkit of global optimizationsLayout (e.g. dim order, cache-line padding) is parameterizedMix & match platform-agnostic & target specific passes

Caveats?It's still early days!Note: some won't compile by design(e.g. DynamicStitch)Best time to start the dialogue :-)Not all TensorFlow ops compileWins accumulating day by day, not everything is faster yetHaven't devoted equal time to all platformsWith the community we believe we could do much more!Open source release in O(1 month)

(That being said.)Benchmark ResultsTF:XLA:GPU vs TF:GPU

XLA gives 30% speedupXLA gives 20% speedupIncreasing complexity from "toy demo" to "large, complex neuralnets".

XLA gives 50% speedupXLA gives 80% speedupAh, more real!LSTMs have element-wise ops the compiler "fuses"More on that later.

XLA gives 20% speedupXLA gives 20% speedupVery real: Neural Machine Translation! https://goo.gl/SzbQCSFull-model runs also indicate 20% speedup

Yay!XLA gives 20% speedupNew compiler optimizations tend to benefit across many models

Compilation benefitsSpecializes the code for your computationEliminates op dispatch overheadFuses ops: avoids round trips to memoryAnalyzes buffers: reuses memory, updates in-placeUnrolls, vectorizes via known dimensions executable size: generate what you need!

Under the Hood

XLA program static, decomposed TF opsMath-looking primitive opsMake macro-ops by compositionSupports many neural net definitions

Classic TensorFlow tmaxMath!We get it.

Classic TensorFlow examplebiasesAddweightsMax(0.0, )MatMulexampleslabelsSoftmaxMathier!

Classic TensorFlow examplebiasesAddweightsMatMulMax(0.0, )SoftmaxexampleslabelsAha,one of these things isnot like the others.

A key question:Why write every new macro-op in C ?Why can't we just compose them out of existing TF ops?An answer: you don't want to pay a performance penalty.But, what if op composition had the performance of C ?

TensorFlow:XLA bridge doesbuilt-in op decompositionfor youThe kind of stuff C SoftMax code has inside.auto weighted Dot(input, weights);auto weighted sum Add(weighted, biases, /*broadcast */{1});auto max activation Reduce(weighted sum, Constant(MinValue(F32)), Max, /*reduce dims */{1});auto activations normalized Exp(Sub(weighted sum, max activation, /*broadcast */{0}));auto activations sum Reduce(activations normalized, Constant(0.0f), Add, /*reduce dims */{1});auto predicted Div(activations normalized,activations sum, /*broadcast */{0});primitive operation composition fused & optimizedcomposite kernel

Automatic Operation FusionXLA composes & specializes primitive operationsNote: this is all expressible in TensorFlowNot done due to performance concernsXLA removes the performance concernAvoids combinatorial explosion of op fusions(e.g. for custom LSTM cell)macro-ops * primitives *dim sizes * backends * devices!

XLA APIs(never seen by normal TensorFlow users)

XLA Block DiagramTensorFlowComputationBuilder APIBuilds "HLO IR"High-Level Optimizer (HLO):Target IndependentExecutor xecutorLowering to "LLO IR"Low-Level Optimizer (LLO):Target SpecificCode CacheAssembled codegeneration

XLA is Designed for ReuseRetargetability & pragmatismPluggable backendsHLO pass "toolkit"Can emit calls to libraries like BLAS or CuDNNEither use LLVMOr Bring-Your-Own Low Level Optimizer

Minimal XLA backend:An LLVM pipelineA StreamExecutor plugin

XLALet's instantiate for differentplatforms!TensorFlowComputationBuilder APIHigh-Level Optimizer (HLO)Executor APIIn-MemoryExecutableObjectCode CacheTransferManagerStreamExecutorLow-Level Optimizer (LLO)

XLA:CPUTensorFlowIn-memory {ARM, PPC, x86} JIT blobComputationBuilder APIHigh-Level Optimizer (HLO)Executor APIIn-MemoryExecutableObjectCode CacheTransferManagerStreamExecutor:HostLLVM: TARGET

XLA:GPU:CUDATensorFlowIn-memory kernels & library callsComputationBuilder APIHigh-Level Optimizer (HLO)Executor APIIn-MemoryExecutableObjectCode CacheTransferManagerStreamExecutor:CUDALLVM:NVPTX

XLA:GPU:OpenCLTensorFlowIn-memory kernels & library callsComputationBuilder APIHigh-Level Optimizer (HLO)Executor APIIn-MemoryExecutableObjectCode CacheTransferManagerStreamExecutor:OpenCLLLVM: TARGET

{CPU, GPU} HLO pipeline; one slide each

cpu compiler.ccMixestarget-independent passes& dependent passesin a pipelineHloPassPipeline pipeline("CPU");pipeline.AddPass Inliner ().AddPass ConvCanonicalization ().AddPass HloPassFix ReshapeMover ().AddPass HloSubcomputationUnification ().AddPass HloCSE (/*is layout sensitive */false).AddPass CpuInstructionFusion ().AddPass CpuLayoutAssignment ();.AddPass HloPassFix AlgebraicSimplifier (/*is layout sensitive */true, /*add bitcasts */true).AddPass HloCSE (/*is layout sensitive */true).AddPass CopyInsertion ().AddPass ParallelizationPreparation ();pipeline.Run(hlo module);

gpu compiler.ccPasses are reusedacross targetsHloPassPipeline pipeline("GPU");pipeline.AddPass ConvolutionFolding ().AddPass ReshapeMover ().AddPass TransposeFolding ()Specialize/optimize for.AddPass HloSubcomputationUnification ()runtime-observed device.AddPass HloCSE (/*is layout sensitive */false).AddPass HloPassFix ReduceFactorizer (device desc.threads per core limit() * device desc.core count()).AddPass HloPassFix AlgebraicSimplifier (false).AddPass ReduceSplitter ().AddPass GpuInstructionFusion (/*may duplicate */false).AddPass PadInsertion ().AddPass GpuLayoutAssignment ()Not shown: buffer assignment.AddPass HloPassFix AlgebraicSimplifier (& stream assignment too!/*is layout sensitive */true, /*add bitcasts */true).AddPass HloCSE (/*is layout sensitive */true).AddPass GpuCopyInsertion ();pipeline.Run(hlo module);

XLA: Prototype to DeploymentPotential at various phases of the lifecycleJIT compilation when prototypingCompilation caching as you scaleE.g. peak memory usageAoT compilation for mobile/embedded & latencyControl & observe static properties of the program

Future WorkALWAYS MORE PERFORMANCE!Multi-device-targeting compilationCross-layer optimizationsSparse operation supportFeedback-directed opt & auto-tuning

Conclusions:XLA release for TensorFlow is coming soon!Performance will improve across the boardWrite the code naturally, let compiler deal with performanceModular infrastructureWhole-program optimizationMix compilation & library techniquesEasy to target wide variety of different kinds of HWPre-release Documentation (or search TensorFlow GitHub repository for ter/resources/xla prerelease.html

Backup slides in case internetdoesn’t work for video

TensorFlow w/XLA: TensorFlow, Compiled! Expressiveness with performance Jeff Dean Google Brain team g.co/brain p