Systems For Machine Learning And Machine Learning For Systems Jeff Dean .

Transcription

Machine Learning for SystemsandSystems for Machine LearningJeff DeanGoogle Brain teamg.co/brainPresenting the work of many people at Google

Systems for Machine LearningGoogle Confidential Proprietary (permission granted to share within NIST)

General Purpose Processor Performance TrendsSingle-coreperformanceplateauingafterdecades ofexponentialgrowthGraph from 40 Years of Microprocessor Trend Data, Karl Rupp, CC-BY 4.0.

Just when deep learning is creating insatiablecomputation demandsTraining powerful but computationally-expensive deep models on: Terabyte or petabyte-sized training datasetsPlus techniques like AutoML (“Learning to learn”, Neural ArchitectureSearch, etc.) can multiply desired training computation by 5-1000XInference using expensive deep models in systems with: hundreds of thousands of requests per second latency requirements of tens of milliseconds billions of users

More computational power neededDeep learning is transforming how wedesign computersGoogle Confidential Proprietary (permission granted to share within NIST)

Special computation propertiesreducedprecisionokabout 1.2 about 0.6about 0.71.21042NOT 0.611270.73989343

Special computation propertiesreducedprecisionokhandful ofspecificoperationsabout 1.2 about 0.61.21042NOTabout 0.7 0.611270.73989343

Tensor Processing Unit v1Google-designed chip for neural net inferenceIn production use for 36 months: used on searchqueries, for neural machine translation, for speech, forimage recognition, for AlphaGo match, In-Datacenter Performance Analysis of a Tensor ProcessingUnit, Jouppi, Young, Patil, Patterson et al., ISCA 2017,arxiv.org/abs/1704.04760

TPUv1 is a huge help for inferenceBut what about training?Speeding up training hugely important:for researcher productivity, andfor increasing scale of problems that can be tackled

Tensor Processing Unit v2Google-designed device for neural net training and inference

Tensor Processing Unit v2Google-designed device for neural net training and inference

TPUv2 ChipHBM8 GB 16 GB of HBM600 GB/s mem BWScalar/vector units:32b floatMXU: 32b floataccumulation butreduced precision formultipliers45 MXU128x128MXU128x128HBM8 GB

Tensor Processing Unit v2 180 teraflops of computation, 64 GB of HBM memory, 2400 GB/s mem BWDesigned to be connected together into larger configurations

TPU Pod64 2nd-gen TPUs11.5 petaflops4 terabytes of HBM memory

Programmed via TensorFlowSame program will run w/only minor modifications on CPUs, GPUs, & TPUsSame program scales via synchronous data parallelism without modificationon TPU podsOffered via Google CloudCloud TPU - host w/180 TFLOPS TPUv2 device attachedg.co/tpusignup

Accelerated Linear Algebra (XLA) JIT / AOT compiler for linear algebraTargets multiple backends, e.g. CPUs, GPUs, and TPUsCompiler, runtime, and accelerator-specific optimizerCompiler plus CPU and GPU backends open-sourcedas part of TensorFlowThe life of a neural network:model.pyTF Estimator codeTF Graph

Accelerated Linear Algebra (XLA) JIT / AOT compiler for linear algebraTargets multiple backends, e.g. CPUs, GPUs, and TPUsCompiler, runtime, and accelerator-specific optimizerCompiler plus CPU and GPU backends open-sourcedas part of TensorFlowThe life of a neural network:XLAmodel.pyTF Estimator codeTF ificcode generation

Some TPU Success StoriesInternal search ranking model training:14.2X: 9 hours on 1/4 pod vs. 132 hours on 275 high end CPU machinesInternal image model training:9.8X: 22 hours on 1/4 pod vs. 216 hours on previous production setupWaveNet production model inference:Generates speech at 20X real time

Some TPU Success StoriesResnet-50 to 76% accuracy:1402 minutes (23 hours 22 minutes) on single TPUv2 device45 minutes on 1/2 pod (32 TPUv2 devices, 31.2X speedup)Resnet-50 to 75% accuracy:22 minutes on full pod (64 TPUv2 devices)same code, no special tricks

Some TPU Success StoriesResnet-50 to 76% accuracy:1402 minutes (23 hours 22 minutes) on single TPUv2 device45 minutes on 1/2 pod (32 TPUv2 devices, 31.2X speedup)Resnet-50 to 75% accuracy:22 minutes on full pod (64 TPUv2 devices)same code, no special tricksPlug:Come see Sam Smith’s talk on “Don't Decay the Learning Rate, Increase the BatchSize” tomorrow at 8:50 AM and Chris Ying’s talk “Imagenet is the new MNIST” at9:30 AM, both in the Deep Learning at Supercomputing Scale workshop in 101B

TPU Scaling for ResNet-50

More than just ImageNetTransformer model from "Attention isAll You Need"(2017 A. Vaswani et. al., NIPS 2017)WMT’14 English-German translationtaskAdam optimizer - same learning rateschedule across configurationsbatch size(i/o tokens)# TPUsTime toPPL 4.816k / 16k117.9 hours32k / 32k43.5 hours256k / 256k161.1 hours1M / 1M640.5 hours

Making 1000 Cloud TPUs available for free to top researchers who arecommitted to open machine learning researchWe’re excited to see what researchers will do with much more computation!g.co/tpusignup

What should we build in future MLaccelerators?Google Confidential Proprietary (permission granted to share within NIST)

ML Arxiv Papers per Year

If you start an ASIC machine learning acceleratordesign today, .Starts to get deployed into production in 2 yearsMust remain relevant through 5 years from nowCan We See The Future Clearly Enough?What should we bet on?

Some Example QuestionsPrecision:Will very-low precision training (1-4 bit weights, 1-4 bit activations)work in general across all problems we care about?Sparsity and embeddings: How should we handle:Dynamic routing like the sparsely-gated Mixture of Experts work (ICLR’17)Very large embeddings for some problems (e.g. 1B items x 1000D)Batch size:Should we build machines for very large batch sizes? Or batch size 1?Training algorithms:Will SGD-like algorithms remain the dominant training paradigm?Or will large-batch second-order methods like K-FAC be better?

Machine Learning for SystemsGoogle Confidential Proprietary (permission granted to share within NIST)

Learning Should Be Used Throughout ourComputing SystemsTraditional low-level systems code (operating systems,compilers, storage systems) does not make extensive use ofmachine learning todayThis should change!A few examples and some opportunities.

Machine Learning forHigher Performance Machine LearningModelsGoogle Confidential Proprietary (permission granted to share within NIST)

For large models, model parallelism is important

For large models, model parallelism is importantBut getting good performance given multiplecomputing devices is non-trivial and non-obvious

SoftmaxABCDAttentionABCDABCLSTM 2LSTM 1ABCD

SoftmaxAttentionLSTM 2GPU2LSTM 1GPU1ABCGPU4ABCDGPU3ABCDDABC

Reinforcement Learning forHigher Performance Machine Learning ModelsDevice Placement Optimization with Reinforcement Learning,Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Bengio, Benoit Steiner, Yuefeng Zhou,Naveen Kumar, Rasmus Larsen, and Jeff Dean, ICML 2017, arxiv.org/abs/1706.04972

Reinforcement Learning forHigher Performance Machine Learning ModelsPlacement model(trained via RL) getsgraph as input setof devices, outputsdevice placement foreach graph nodeDevice Placement Optimization with Reinforcement Learning,Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Bengio, Benoit Steiner, Yuefeng Zhou,Naveen Kumar, Rasmus Larsen, and Jeff Dean, ICML 2017, arxiv.org/abs/1706.04972

Reinforcement Learning forHigher Performance Machine Learning ModelsPlacement model(trained via RL) getsgraph as input setof devices, outputsdevice placement foreach graph nodeMeasured timeper step givesRL reward signalDevice Placement Optimization with Reinforcement Learning,Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Bengio, Benoit Steiner, Yuefeng Zhou,Naveen Kumar, Rasmus Larsen, and Jeff Dean, ICML 2017, arxiv.org/abs/1706.04972

Device Placement with Reinforcement LearningPlacement model (trainedvia RL) gets graph as input set of devices, outputsdevice placement for eachgraph node 19.3% faster vs. expert human for neuraltranslation modelMeasured timeper step givesRL reward signal 19.7% faster vs. expert human for InceptionV3image modelDevice Placement Optimization with Reinforcement Learning,Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Bengio, Benoit Steiner, Yuefeng Zhou,Naveen Kumar, Rasmus Larsen, and Jeff Dean, ICML 2017, arxiv.org/abs/1706.04972

Device Placement with Reinforcement LearningPlacement model (trainedvia RL) gets graph as input set of devices, outputsdevice placement for eachgraph nodeMeasured timeper step givesRL reward signalPlug: Come see Azalia Mirhoseini’s talk on “Learning DevicePlacement” tomorrow at 1:30 PM in the Deep Learning atSupercomputing Scale workshop in 101B 19.3% faster vs. expert human for neuraltranslation model 19.7% faster vs. expert human for InceptionV3image modelDevice Placement Optimization with Reinforcement Learning,Azalia Mirhoseini, Hieu Pham, Quoc Le, Mohammad Norouzi, Samy Bengio, Benoit Steiner, Yuefeng Zhou,Naveen Kumar, Rasmus Larsen, and Jeff Dean, ICML 2017, arxiv.org/abs/1706.04972

Learned Index StructuresnotConventional Index StructuresGoogle Confidential Proprietary (permission granted to share within NIST)

B-Trees are ModelsThe Case for Learned Index Structures, Tim Kraska, Alex Beutel, Ed Chi, Jeffrey Dean & Neoklis Polyzotis, arxiv.org/abs/1712.01208

Indices as CDFsThe Case for Learned Index Structures, Tim Kraska, Alex Beutel, Ed Chi, Jeffrey Dean & Neoklis Polyzotis, arxiv.org/abs/1712.01208

Does it Work?Index of 200M web service log recordsTypeConfigLookup timeSpeedup vs. BtreeSize (MB)Size vs. BtreeBTreepage size: 128260 ns1.0X12.98 MB1.0XLearned index2nd stage size: 10000222 ns1.17X0.15 MB0.01XLearned index2nd stage size: 50000162 ns1.60X0.76 MB0.05XLearned index2nd stage size: 100000144 ns1.67X1.53 MB0.12XLearned index2nd stage size: 200000126 ns2.06X3.05 MB0.23XThe Case for Learned Index Structures, Tim Kraska, Alex Beutel, Ed Chi, Jeffrey Dean & Neoklis Polyzotis, arxiv.org/abs/1712.01208

Hash TablesThe Case for Learned Index Structures, Tim Kraska, Alex Beutel, Ed Chi, Jeffrey Dean & Neoklis Polyzotis, arxiv.org/abs/1712.01208

Bloom FiltersModel is simple RNNW is number of units in RNN layerE is width of character embedding 2X space improvement overBloom Filter at same false positive rateThe Case for Learned Index Structures, Tim Kraska, Alex Beutel, Ed Chi, Jeffrey Dean & Neoklis Polyzotis, arxiv.org/abs/1712.01208

Machine Learning for ImprovingDatacenter EfficiencyGoogle Confidential Proprietary (permission granted to share within NIST)

Machine Learning to Reduce Cooling Cost in DatacentersML Control OnML Control OffCollaboration between DeepMind and Google Datacenter operations teams.See le-data-centre-cooling-bill-40/

Where Else Could We Use Learning?Google Confidential Proprietary (permission granted to share within NIST)

Computer Systems are Filled With HeuristicsCompilers, Networking code, Operating Systems, Heuristics have to work well “in general case”Generally don’t adapt to actual pattern of usageGenerally don’t take into account available context

Anywhere We’re Using Heuristics To Make aDecision!Compilers: instruction scheduling, register allocation, loopnest parallelization strategies, Networking: TCP window size decisions, backoff forretransmits, data compression, .Operating systems: process scheduling, buffer cacheinsertion/replacement, file system prefetching, Job scheduling systems: which tasks/VMs to co-locate onsame machine, which tasks to pre-empt, .ASIC design: physical circuit layout, test case selection,

Anywhere We’ve Punted to a User-TunablePerformance Option!Many programs have huge numbers of tunable command-lineflags, usually not changed from their defaults--eventmanager threads 16--bigtable scheduler batch size 8--mapreduce merge memory 134217728--lexicon cache size 1048576--storage server rpc freelist size 128.

Meta-learn everythingML: learning placement decisionslearning fast kernel implementationslearning optimization update ruleslearning input preprocessing pipeline stepslearning activation functionslearning model architectures for specific device types, or that are fastfor inference on mobile device X, learning which pre-trainedcomponents to reuse, Computer architecture/datacenter networking design: learning best design properties by exploring design spaceautomatically (via simulator)

Keys for Success in These Settings(1) Having a numeric metric to measure and optimize(2) Having a clean interface to easily integrate learning intoall of these kinds of systemsCurrent work: exploring APIs and implementationsBasic ideas:Make a sequence of choices in some contextEventually get feedback about those choicesMake this all work with very low overhead, even indistributed settingsSupport many implementations of core interfaces

ConclusionsML hardware is at its infancy.Even faster systems and widerdeployment will lead to manymore breakthroughs across awide range of domains.Learning in the core of all of ourcomputer systems will makethem better/more adaptive.There are many opportunities forthis.More info about our work at g.co/brain

Machine Learning for Systems and Systems for Machine Learning Jeff Dean Google Brain team g.co/brain Presenting the work of many people at Google. . as part of TensorFlow The life of a neural network: model.py XLA Target-independent optimizations Target-specific code generation XLA