Deep Learning Cookbook: Technology Recipes To Run Deep .

Transcription

Deep Learning Cookbook: technologyrecipes to run deep learning workloadsNatalia Vassilieva, Sergey Serebryakov

Deep learning applicationsVisionSpeechTextOther Search & informationextraction Security/Videosurveillance Self-driving cars Medical imaging Robotics Interactive voiceresponse (IVR) systems Voice interfaces (Mobile,Cars, Gaming, Home) Security (speakeridentification) Health care Simultaneousinterpretation Search and rankingSentiment analysisMachine translationQuestion answeringRecommendation enginesAdvertisingFraud detectionAI challengesDrug discoverySensor data analysisDiagnostic support2

Deep learning ecosystemKerasSoftwareHardware3

How to pick the right hardware/software stack?Does one size fit all?4

Applications break downImagesTissue classification inmedical imagesVideoVideo surveillanceDetectionSpeechSpeech recognitionGenerationTextSentiment analysisSensorPredictive maintenanceOtherFraud detectionLook for a known object/patternGenerate contentClassificationAssign a label from a predefined set oflabelsAnomaly detectionLook for abnormal, unknown patterns5

Types of artificial neural networksTopology to fit data characteristicsImages:Convolutional (CNN)InputHiddenLayer 1HiddenLayer 2Speech, time series, sequences:Fully Connected (FC), Recurrent (RNN)HiddenLayer 3OutputInputHiddenLayer 1HiddenLayer 2HiddenLayer 3Output6

One size does NOT fit allApplicationData typeModel (topology of artificial neural network):- How many layers- How many neurons per layer- Connections between neurons (types of layers)Data size7

Popular modelsModel sizeTypeAlexNetCNN60,965,224233 MB0.7GoogleNetCNN6,998,55227 MB1.6VGG-16CNN138,357,544528 MB15.5VGG-19CNN143,667,240548 MB19.6ResNet50CNN25,610,26998 MB3.9ResNet101CNN44,654,608170 MB7.6ResNet152CNN60,344,387230 MB11.3Eng Acoustic ModelRNN34,678,784132 MB0.035TextCNNCNN151,6900.6 MB0.009(# params)Model size (MB)GFLOPsName(forward pass)8

Popular modelsModel sizeTypeAlexNetCNN60,965,224233 MB0.7GoogleNetCNN6,998,55227 MB1.6VGG-16CNN138,357,544528 MB15.5VGG-19CNN143,667,240548 MB19.6ResNet50CNN25,610,26998 MB3.9ResNet101CNN44,654,608170 MB7.6ResNet152CNN60,344,387230 MB11.3Eng Acoustic ModelRNN34,678,784132 MB0.035TextCNNCNN151,6900.6 MB0.009(# params)Model size (MB)GFLOPsName(forward pass)9

Compute requirementsNameTypeResNet152CNNModel size(# params)60,344,387Model size (MB)230 MBGFLOPs(forward pass)11.3Training data: 14M images (ImageNet)FLOPs per epoch: 3 11.3 10& 14 10( 5 10 ,1 epoch per hour: 140 TFLOPSTodayโ€™s hardware:Google TPU2: 180 TFLOPS Tensor opsNVIDIA Tesla V100: 15 TFLOPS SP (30 TFLOPS FP16, 120 TFLOPS Tensor ops), 12 GB memoryNVIDIA Tesla P100: 10.6 TFLOPS SP, 16 GB memoryNVIDIA Tesla K40: 4.29 TFLOPS SP, 12 GB memoryNVIDIA Tesla K80: 5.6 TFLOPS SP (8.74 TFLOPS SP with GPU boost), 24 GB memoryINTEL Xeon Phi: 2.4 TFLOPS SP10

Model parallelismโ€“ Can be achieved with scalable distributed matrix operationsโ€“ Requires a certain compute/bandwidth ratioLetโ€™s assume:๐‘›๐›พ๐›ฝ๐‘;โ€“ input size batch size output sizeโ€“ compute power of the device (FLOPS)โ€“ bandwidth (memory or interconnect)โ€“ number of compute devices๐‘‡/0123452๐‘›9 ;๐‘ ๐›พ๐›ฝ 4๐‘๐›พ๐‘›๐‘‡ 4 @5 2๐‘›; ๐‘๐›ฝABfor FP32โ€œSUMMA: Scalable Universal Matrix Multiplication Algorithmโ€,R.A. van de Geijn, J. Watts11

Model parallelismโ€“ Can be achieved with scalable distributed matrix operationsโ€“ Requires a certain compute/bandwidth ratioLetโ€™s assume:๐‘›๐›พ๐›ฝ๐‘;โ€“ input size batch size output sizeโ€“ compute power of the device (FLOPS)โ€“ bandwidth (memory or interconnect)โ€“ number of compute devices๐‘‡/0123452๐‘›9 ;๐‘ ๐›พ๐›ฝ 4๐‘๐›พ๐‘›๐‘‡ 4 @5 2๐‘›; ๐‘๐›ฝ๐‘› 2000,ฮณ 15 ๐‘‡๐น๐ฟ๐‘‚๐‘ƒ๐‘†for FP32๐‘ 10,๐‘ 1,๐›ฝ 300 ๐บ๐ต/๐‘ ๐›ฝ 30 ๐บ๐ต/๐‘ 12

Data parallelism๐‘‡/012345 (๐‘, ๐‘, ๐›พ) ๐‘/(๐‘๐›พ)๐‘‡/0113QR/ 45 (๐‘, ๐‘ค, ๐›ฝ) ๏ฟฝ number of workers (nodes),โ€“ the computational power of the node,โ€“ the computational complexity of the model,โ€“ bandwidth,โ€“ the size of the weights in bits.13

Data parallelism๐‘‡/012345 (๐‘, ๐‘, ๐›พ) ๐‘/(๐‘๐›พ)๐‘‡/0113QR/ 45 (๐‘, ๐‘ค, ๐›ฝ) ๏ฟฝ number of workers (nodes),โ€“ the computational power of the node,โ€“ the computational complexity of the model,โ€“ bandwidth,โ€“ the size of the weights in bits.NVIDIA K40 ( 4 TFLOPS), PCIe v3 ( 16 GB/s)14

Data parallelism๐‘‡/012345 (๐‘, ๐‘, ๐›พ) ๐‘/(๐‘๐›พ)๐‘‡/0113QR/ 45 (๐‘, ๐‘ค, ๐›ฝ) ๏ฟฝ number of workers (nodes),โ€“ the computational power of the node,โ€“ the computational complexity of the model,โ€“ bandwidth,โ€“ the size of the weights in bits.NVIDIA K40 ( 4 TFLOPS), Infiniband ( 56 Gb/s)15

Deep Learning Cookbook helps to pick the right HW/SW stack- Benchmarking suite- Benchmarking scripts- Set of benchmarks (for core operations and reference models)- Performance measurements for a subset of applications, models and HW/SW stacks- 11 models- 8 frameworks- 6 hardware systems- Analytical performance and scalability models- Performance prediction for arbitrary models- Scalability prediction- Reference solutions, white papers16

17

Selected scalability resultsHPE Apollo 6500 (8 x NVIDIA P100)150.00100.00Batch size:6450.001280.00525.0020.00Batch size:15.003210.00645.001280.001005Number of GPUs200.0032100.00641280.0010Batch time (ms)Batch time (ms)Batch size:300.00Number of GPUs40.00Batch size:30.003220.006410.001280.0005VGG19 Weak Scaling800.00Batch size:600.00400.0016200.0032640.000246Number of GPUs10Number of GPUsVGG16 Weak Scaling400.00550.00Number of GPUsGoogleNet Weak Scaling010810Batch time (ms)0EngAcousticModel Weak ScalingBatch time (ms)DeepMNIST Weak ScalingBatch time (ms)Batch time (ms)AlexNet Weak Scaling1000.00Batch er of GPUs18

Selected observations and tipsโ€“ Larger models are easier to scale (such as ResNet and VGG)โ€“ A single GPU can hold only small batches (the rest of memory is occupied by a model)โ€“ Fast interconnect is more important for less compute-intensive models (FCC)โ€“ A rule of thumb: 1 or 2 CPU cores per GPUโ€“ PCIe topology of the system is important19

Further into the future: neuromorphic research projectsNeuromorphic Computing โ€“ the integration of algorithms,architectures, and technologies, informed by neuroscience, tocreate new computational approaches.โ€“ Memristor Dot-Product Engine (DPE) โ€“successfully demonstratedโ€“ Memristor crossbar analog vector-matrixmultiplication acceleratorโ€“ Hopfield Network (electronic and photonic) โ€“in progressIjO j Gij . ViIHewlett Packard Enterprise

Thank youNatalia Vassilievanvassilieva@hpe.com21

Machine translation Question answering Recommendation engines Advertising Fraud detection AI challenges Drug discovery Sensor data analysis Diagnostic support Deep learning applications 2 Vision Speech Text Other. Hardware Software Deep learning ecosystem 3 Keras. 4 How to pick the right hardware/software stack? Does one size fit all? Applications break down 5 .