Embedded Machine Learning - TU Wien

Transcription

Embedded MachineLearningAxel JantschTU Wien, Vienna, AustriaDecember 2020

1 Motivation and Challenges2 HW Friendly Optimizations3 CNN Accelerator Architectureswww.ict.tuwien.ac.atOutline4 MSc Projects5 Quantization6 Inference Run Time Estimation2

www.ict.tuwien.ac.atMotivation and Challenges4

Machine learning is a powerful method to analyze data; Embedded applications produce huge amounts of sensordata;www.ict.tuwien.ac.atWhy Embedded Machine Learning ? The data can or should not always be moved to centralservers;5

www.ict.tuwien.ac.atCompute Usage TrendFrom: https://openai.com/blog/ai-and-compute/6

www.ict.tuwien.ac.atTotal Energy ConsumptionSIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep. Semiconductor Industry Association andSemiconductor Research Corporation, Sept. 20157

www.ict.tuwien.ac.atDeployed SensorsSIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep. Semiconductor Industry Association andSemiconductor Research Corporation, Sept. 20158

Resource limitation Connectivitywww.ict.tuwien.ac.atWhat is Special About “Embedded”? Security Privacy9

Resource limitationsComputation [flop]Memory [bit]Power [W]Energy [Wh]EmbeddedServer farm30 1800 · 101210105-10048-100086 · 10181015103 106200 · 106www.ict.tuwien.ac.atWhat is Special About “Embedded”?Computation Embedded refers to an Nvidia Jetson Nano running 1 min and 1 hour,respectively.Computation server refers to the computation needed for the 40 day experiment withAlphaGo ZeroEnergy embedded refers to a mobile phone and to a car battery, respectively.Energy server refers to the 40 day experiment for AlphaGo Zero.10

Embedded inference: More energy efficientBandwidth constraintsLatency constraintsNot always on-line and connected to a cloud serverSecurityPrivacywww.ict.tuwien.ac.atCase for Embedded ML Embedded continuous learning: Customization and specialization Security Privacy11

CloudEmbeddedInferencewww.ict.tuwien.ac.atDNNs: Embedded and the CloudTrainingEmbeddedEmbeddedCloud basedInferencePreprocessingInference12

Design SpaceMapping ChoicesNeuron pruningData type selectionApproximationRetrainingConnection pruningWeight NN ChoicesConvolutional layersFilter kernelsNumber of filtersPooling layersFilter shapeStrideFully connected layerNumber of layersRegularizationetc.Platform ChoicesNvidia TuringARM NNPlatform SelectionReconfigurationBatch processingDeep pipeliningResource reuseHierarchical controlProcessing unit selectionMemory allocationMemory reuseetc.13

1In which year will computing consume all world-wideproduced energy, ifa current trends continue without major innovation intechnology;Choices: 2027, 2032, 2037, 2042, 2047www.ict.tuwien.ac.atQuiz Timeb current trends continue with aggressive and majorinnovations in technology?14

www.ict.tuwien.ac.atHW Friendly Optimizations16

1Minimize number of operations to be performed;2Simplify each operation;3Execute the operations as efficient as possible.www.ict.tuwien.ac.atOptimization Categories17

Loop reordering, unrolling, pipelining Tiling Batching Binarized CNNswww.ict.tuwien.ac.atHW Friendly Optimizations18

Convolution layer algorithm:www.ict.tuwien.ac.atLoop Optimizationsf o r ( t o 0; t M; t o ) {/ / o u t p u t f e a t u r e mapf o r ( t i 0; t N ; t i ) {/ / i n p u t f e a t u r e mapf o r ( row 0; row R ; row ) {/ / rowf o r ( c o l 0; c o l C ; c o l ) { / / columnf o r ( i 0; i K ; i ) {// filterf o r ( j 0; j K ; j ) {Ofmap [ t o ] [ row ] [ c o l ] W[ t o ] [ t i ] [ i ] [ j ] Ifmap [ t i ] [ S row i ] [ S c o l j ] ;}}}}}}M .N .R .C .K .S .W .number of output feature mapsnumber of input feature mapsnumber of rowsnumber of columnsfilter kernel sizestrideweight matrix19

Loop reordering to improve cache efficiency;Loop unrolling to improve parallelism;www.ict.tuwien.ac.atLoop OptimizationsLoop pipelining to improve parallelism.20

www.ict.tuwien.ac.atLoop Tilingfor (to 0; t M; to ){// output feature mapfor (ti 0; t N; ti ) {// input feature mapfor (row 0; row R; row Tr) { // tiled row loopfor (col 0; col C; col ) { // columnfor (trr row; trr min(R,row Tr); trr ) {for (i 0; i K; i ) {// filterfor (i 0; i K; i ) {Ofmap[to][trr][col] W[to][ti][i][j]* Ifmap[ti][S*trr i][S*col j];}}}}}}}For efficient use of caches.21

www.ict.tuwien.ac.atBatching Reuse of weights; Improves throughput; Increases latency.22

Weights and internal computation are represented as 1bbinary numbers; Instead of MAC operations, BNNs use XOR and bit-count; Attractive for HW and FPGAswww.ict.tuwien.ac.atBinarized CNNs (BNN)23

2Which optimization method is most promising in improvingML compute efficiency:a Optimize network to minimize number of computations,b Improve processing element and processing datapatharchitecture,www.ict.tuwien.ac.atQuiz Timec Improve memory architecture,d Optimize mapping of network onto target architecture,e Minimize data type ?24

www.ict.tuwien.ac.atCNN AcceleratorArchitectures26

Systolic array architecture In-memory computingwww.ict.tuwien.ac.atCNN Accelerator Architectures27

www.ict.tuwien.ac.atSystolic Arrays28

www.ict.tuwien.ac.atTensor Flow Unit (TPU)N. P. Jouppi et al. “In-datacenter performance analysis of a tensor processing unit”. In: 2017 ACM/IEEE 44th Annual29

www.ict.tuwien.ac.atTensor Flow Unit (TPU)30

www.ict.tuwien.ac.atTensor Flow Unit (TPU)31

Storage capacity and memory access bandwidth andlatency dominate DNNs. Avoid moving data.www.ict.tuwien.ac.atIn-memory Computing Distribute the MAC units in the memory architecture.32

www.ict.tuwien.ac.atWire Aware Accelerator (WAX)Sumanth Gudaparthi et al. “Wire-Aware Architecture and Dataflow for CNN Accelerators”. In: Proceedings of the52nd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO ’52. Columbus, OH, USA:Association for Computing Machinery, 201933

3For optimizing CNN execution, is it more effectivea to re-use (and keep on-chip) input data,www.ict.tuwien.ac.atQuiz Timeb to re-use weights,c to re-use intermediate data ?34

www.ict.tuwien.ac.atMSc Projects36

Embedded High Speed ImageClassification/Detection/Segmentation with QuantizedNeural Networks Evaluation of Quantization Aware Training Methods forINT8 and lower bit-widths for image classification, objectdetection and segmentation Power and Latency Optimization of the Xilinx ZCU102Platform for a Pedestrian Intention Recognition Use Case Optimization of 3D Convolutionswww.ict.tuwien.ac.atMSc Projects Optimization of Object Detection Networks on the GoogleEdge TPU37

?Questions ?38www.ict.tuwien.ac.at

www.ict.tuwien.ac.atQuantization40

Using small bit width for weights saves memory, bandwidthand computation; Bit width can be different for different layers of the DNN; Quantization scheme: Dynamic fixed point, power of 2; Retraining after quantization recovers accuracy losses:Regularization; Not all weights are equal: Weighted regularization.www.ict.tuwien.ac.atQuantization - RegularizationMatthias Wess, Sai Manoj Pudukotai Dinakarrao, and Axel Jantsch. “Weighted Quantization-Regularization in DNNsfor Weight Memory Minimization towards HW Implementation”. In: IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems 37.10 (Oct. 2018)41

Quantization - Motivationwww.ict.tuwien.ac.at DNN quantization Reduces data movement Reduces logic energy Layerwise bit-width v8conv9INPUTChannels: 96Weight Memory [bit]1,2E nvNet - CIFAR-101,0E 78,0E 6Weights6,0E 64,0E 6QuantizedWeights2,0E 60,0E 0conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 conv9[7][7][7][4][4][3][3][7][7]Layer[bit-width]42

Layer 1Full Resolution600 WeightsLayer 2Full Resolution900 WeightsLayer 13-bit DFPLayer 22-bit DFPWeight ValueWeight ValueWeight Count(a)www.ict.tuwien.ac.atQuantization - MotivationWeight Count(b)43

www.ict.tuwien.ac.atCIFAR-100CIFAR100 compression65BaselineAccuracy in %60555045404.0Equal DFPLayer-wise DFPEqual Po24.55.0Eq. DFP RetrainedLw. DFP RetrainedEq. Po2 Retrained5.56.06.5Compression Ratio7.07.58.08.544

Dynamic Fixed Point is an effective alternative to integer orfloating point representation; Different layers require different precision; Retraining adjusts the network to the available weightvalues; Weight memory reduction of 4-8x is common;www.ict.tuwien.ac.atQuantization - Conclusion Reduced weight precision reduces weight memory andcost of operation.45

Embedded Machine Learning has many applications; Bandwidth limitations; Delay constraints; Privacy; Security;www.ict.tuwien.ac.atSummary There are distinct challenges: Limited resources; Specialized HW platforms; Huge design space for optimization and mapping.46

www.ict.tuwien.ac.atInference Run TimeEstimation48

www.ict.tuwien.ac.atEstimation Framework49

Assumption: Inference time as afunction of problemsize is a combinationof step and linearfunctions due tolimited parallelresources. www.ict.tuwien.ac.atInference Run Time EstimationExample: Single convolutionallayer sweep 32x32x64 with k filterand kernel size 350

Assumption:The inference time can be approximated by a combinationof linear and step functions for each dimension, such asfilter, channels, etc. Determining the function based on selected measurementswww.ict.tuwien.ac.atInference Run Time Estimation Goals: automatic computation of estimation functions forlatency, power consumption and various platforms.51

www.ict.tuwien.ac.atAutomatic Estimation Function Generation52

www.ict.tuwien.ac.atIterative Refinement53

www.ict.tuwien.ac.atIterative Refinement54

www.ict.tuwien.ac.atIterative Refinement55

Linear function criteria: Point furthestaway fromprevious points Step function criteria Point with mostunique discretelevels Point with largestrange of values Point farthestaway fromprevious pointswww.ict.tuwien.ac.atNext Point Selection Next point selection:Point with highest score56

Parameter optimizationfor step function criteria- Grid search Step function criteria Point with mostunique discretelevels Point with largestrange of values Point farthestaway fromprevious pointswww.ict.tuwien.ac.atNext Point Selection Interesting areas(cluster of purplevalues)57

Results after 3iterations (5meausurment points) Execution times:www.ict.tuwien.ac.atMethod Evaluation Full sweep: 3-4 h Proposedapproach: 2-5minutes58

Phase 1: Estimatefunction in singledimension: number offilters Result: step functionwww.ict.tuwien.ac.at2D Example59

Phase 2: Test how d, wand h behave in the nextdimension Next dimension: inputchannels din Result: Step function:d0 0.1418 , w0 8,h0 0.0106 Constant: c 32 Step function:d1 0.044,w1 8,h1 0.0121www.ict.tuwien.ac.at2D Example60

Generated model:f (din , k )din 1 0.1418 bc0.01068 din 1k 1 b0.044 bc0.0121328www.ict.tuwien.ac.at2D Example Meausrement points: 112 Execution time: 32 minutes61

www.ict.tuwien.ac.at2D Example - Error62

www.ict.tuwien.ac.at2D Example - Error63

Slice through 2D plane at k 1024f (din , k )din 1c0.01068 k 1din 1 bc0.01210.044 b328 0.1418 bwww.ict.tuwien.ac.at2D Example - Errorf (din , 1024) 1.5058 bdin 1c0.3857864

Slice through 2D plane at din 128f (din , k )din 1c0.01068 k 1din 1 b0.044 bc0.0121328 0.1418 bwww.ict.tuwien.ac.at2D Example - Errorf (128, k ) 0.3008 bk 1c0.22553265

Exploiting the discrete nature of HW resources Fast estimation function for latency based on linear an stepfunctions Automatic derivation of estimation for a new platformwww.ict.tuwien.ac.atLatency Estimation Summary Results for Nvidia GPU platforms are promising66

Sumanth Gudaparthi et al. “Wire-Aware Architecture and Dataflow forCNN Accelerators”. In: Proceedings of the 52nd Annual IEEE/ACMInternational Symposium on Microarchitecture. MICRO ’52. Columbus,OH, USA: Association for Computing Machinery, 2019.N. P. Jouppi et al. “In-datacenter performance analysis of a tensorprocessing unit”. In: 2017 ACM/IEEE 44th Annual InternationalSymposium on Computer Architecture (ISCA). 2017, pp. 1–12.SIA - SRC. Rebooting the IT Revolution: A Call to Action. Tech. rep.Semiconductor Industry Association and Semiconductor ResearchCorporation, Sept. 2015.www.ict.tuwien.ac.atReferences IMatthias Wess, Sai Manoj Pudukotai Dinakarrao, and Axel Jantsch.“Weighted Quantization-Regularization in DNNs for Weight MemoryMinimization towards HW Implementation”. In: IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems 37.10(Oct. 2018).Yu Wang, Gu-Yeon Wei, and David Brooks. “Benchmarking TPU, GPU,and CPU Platforms for Deep Learning”. In: CoRR abs/1907.10701(2019). arXiv: 1907.10701.67

?Questions ?68www.ict.tuwien.ac.at

Embedded Machine Learning Axel Jantsch TU Wien, Vienna, Austria December 2020. www.ict.tuwien.ac.at Outline 1 Motivation and Challenges 2 HW Friendly Optimizations 3 CNN Accelerator Architectures 4 MSc Projects 5 Quantization 6 Inference Run Time Estimation 2. Motivation and Challenges www.ict.tuwien.ac.at 4. www.ict.tuwien.ac.at Why Embedded Machine Learning ?