The Anatomy Of An Embedded Machine Learning Accelerator .

Transcription

WP515 (v1.0) November 5, 2019The Anatomy of an EmbeddedMachine Learning AcceleratorThe distributed memories, high compute density, and logicsupport in FPGAs and SoCs make them ideal as MLaccelerator platforms. Designers can leverage the inherentcompute power of these devices with no hardwareimplementation.ABSTRACTAs machine learning (ML) has increased in popularity over the last half-decade,a number of accelerators for ML have been introduced, evaluated, anddeployed on programmable logic.With their unique power/performance point, FPGAs are increasingly a popularchoice for machine learning applications. In addition, FPGAs can performsensor fusion and the other operations that typically surround machinelearning applications.This white paper takes a deliberately generic example of such an accelerator,and explains the purpose of its constituent parts and why the FPGA is an idealimplementation vehicle. Copyright 2019 Xilinx, Inc. Xilinx, the Xilinx logo, Alveo, Artix, Kintex, Spartan, UltraScale, Versal, Virtex, Vivado, Zynq, and other designated brands included herein aretrademarks of Xilinx in the United States and other countries. All other trademarks are the property of their respective owners.WP515 (v1.0) November 5, 2019www.xilinx.com1

The Anatomy of an Embedded Machine Learning AcceleratorIntroductionMachine Learning is an incredibly important application, allowing insights to be drawn from allsorts of different data streams. From facial recognition for security applications, to anomalydetection in industrial settings, traditional deep algorithmic approaches require very specificdomain knowledge.With their unique power/performance point, FPGAs are increasingly a popular choice for machinelearning applications. In addition, FPGAs can perform sensor fusion and the other operations thattypically surround machine learning applications. For example, a system for monitoring driverattention in cars might require interfaces to networks, ultrasonic sensors, multiple cameras, andaudio, all while working in a setting typically bounded by power and thermal constraints. Thesystem might also require more traditional processing, such as video scaling / cropping andmachine learning, to detect when the driver’s attention strays from the road ahead.As machine learning (ML) has increased in popularity over the last half-decade, a number ofaccelerators for ML have been introduced, evaluated, and deployed on programmable logic. Thiswhite paper takes a deliberately generic example of such an accelerator, and explains the purposeof its constituent parts, and why the FPGA is an ideal implementation vehicle.Exploiting Parallelism in a Generic AcceleratorMachine learning (ML) is a very broad topic, covering various approaches that allow generalalgorithms to perform domain-specific tasks by training the general algorithms using data drawnfrom the specific domain. In the driver-attention system outlined previously, a general neuralnetwork could be trained using labeled images of drivers paying attention and drivers not payingattention to the road ahead. Though the neural network itself is “general,” it learns usingdomain-specific knowledge.A number of neural network classes have been proposed over the years. One very important classoften used in computer vision applications are convolutional neural networks (CNNs). CNNs consistof layers that accept three-dimensional (3D) input feature maps (IFMs). For example, a color imagecoming from a VGA camera is considered to be a 640 x 480 x 3 input feature map. CNNs map suchimages to 3D output feature maps (OFMs) using a 3D convolution operation. Though operationssuch as pooling and normalization are executed in CNNs, the bulk of the compute effort lies inperforming the actual 3D convolution operation.Many available online resources seek to create an understanding of the 3D convolution operation.This white paper, however, specifically describes the mapping of these 3D operations onto theFPGA fabric. The basic CNN formula is a 7-deep nested loop, and is shown in simplified form inFigure 1.WP515 (v1.0) November 5, 2019www.xilinx.com2

The Anatomy of an Embedded Machine Learning AcceleratorX-Ref Target - Figure 1for each layer lfor each output feature map plane ofor each output feature map row rfor each output feature map col cfor each filter plane pfor each filter row xfor each filter col yA[l][o][r][c] A[l-1][p][r-x][c-y] * w[l][o][p][x][y]WP515 01 082019Figure 1: Levels of Parallelism Available for Computing DNN InferenceHuge amounts of parallelism are possible. The only data dependency is that the output pixelsproduced in one layer become the input pixels in the next layer. Figure 2 shows some examples ofthe levels of parallelism available to an accelerator.X-Ref Target - Figure elismLayer ParallelismOutputsi00 i01 i10 i11OFMParallelismw00 w01 w10 w11w00 w01 w10 w11Xi01 i02 i11 i12i10 i11 i20 i21i11 i12 i21 i22bit2bit1bit1Multiplybit0Accumulatebit0Weight BitsOutput PixelParallelismIFMParallelismActivation BitsBit Parallelism (inside arithmetic)WP515 02 082019Figure 2: Levels of Parallelism Available for Computing DNN InferenceOne option is to perform layers in parallel, while remembering to factor in the one datadependency in the 7-deep loop nest: the output pixels of one layer must form the input pixels ofthe next layer. In programmable logic, however, it is very common to pipeline independentoperations by employing structures such as line buffers. This allows the next stage of a pipeline tostart when enough of its input arguments have been produced.Another option is to produce multiple output pixels in parallel. Figure 1 shows that, within a givenlayer, output pixels at the same x,y location are produced using the same input pixels, but differentweight sets. Therefore, production of a structure that allows input pixels (a) to flow from left toright and (b) get multiplied by different stored weights allows an accelerator to exploit outputfeature map parallelism, as shown in Figure 3. By storing more than one weight in each localmemory, it is possible to re-use the same row of multipliers to produce multiple output featuremaps. For a row of r multipliers and t time steps, the structure can produce pixels in r · t outputfeature maps.WP515 (v1.0) November 5, 2019www.xilinx.com3

The Anatomy of an Embedded Machine Learning AcceleratorX X X X Weights WeightsXWeights WeightsXWeights WeightsXWeightsWeightsX-Ref Target - Figure 3X WP515 03 082019Figure 3: Producing Multiple Output Feature Map Pixels from a Single Input Feature Map PixelBy feeding in input pixels (a) from the same x,y location but (b) on different input feature maps, astructure can be added that weights the contribution of each of the input feature maps. This allowsan accelerator to exploit input feature map parallelism. See Figure WeightsXWeightsX-Ref Target - Figure 4X WP515 04 083019Figure 4: Consuming Multiple Input Feature Map Pixels for a Single Output Feature Map PixelBy using a distinct accumulator at the bottom of a column of multipliers, it is possible to reuse thesame column of multipliers to consume multiple input feature maps. For a column of c multipliersand t time steps, the structure can consume pixels in c · t input feature maps.Putting the two structures together forms the well-known systolic array structure, shown inFigure 5. This structure exploits both input and output feature map parallelism, and is a great fit forFPGA architectures.WP515 (v1.0) November 5, 2019www.xilinx.com4

The Anatomy of an Embedded Machine Learning AcceleratorWeightsWeightsWeightsWeightsX X X WeightsX X X WeightsX X X WeightsX X X WeightsWeightsWeightsWeightsWeightsX X WeightsX X WeightsX X WeightsX X WeightsWeightsWeightsWeightsWeightsX X WeightsX X WeightsX X WeightsX X WeightsWeightsWeightsWeightsWeightsX X WeightsX X WeightsX X WeightsX X WeightsWeightsWeightsWeightsWeightsX X WeightsX X WeightsX X WeightsX X WeightsWeightsWeightsWeightsWeightsX X WeightsX X WeightsX X WeightsX X tsWeightsXX WeightsWeightsXX WeightsWeightsXX WeightsWeightsXX WeightsWeightsXWeightsX-Ref Target - Figure 5 WP515 05 083019Figure 5: Systolic ArrayBy increasing the number of columns in an array, the array can produce as many output featuremaps as there are columns in parallel. By increasing the number of rows in an array, the array canconsume as many input feature maps as there are rows in parallel. However, as mentioned earlier,the structure produces output maps in multiples of the number of columns. Thus, if the requireddepth of the output feature map is not an exact multiple of the number of columns, then certaincolumns are unused for part of the calculation, reducing the compute efficiency.Similarly, the structure consumes input maps in multiples of the number of rows. Therefore, unlessthe input feature map depth is not an exact multiple of the number of rows, then certain rows areunused for part of the calculation.This can present a problem for early layers in the network that might have a very small number ofinput feature maps—1 or 3, for example. In this case, accelerators should be flexible enough toexploit one of the other forms of parallelism shown in Figure 1, such as performing calculations forall columns of a filter in parallel.Generic Embedded Accelerator ArchitectureA number of architectures are commercially available, including the Xilinx Deep LearningProcessing Unit (DPU) architectures [Ref 1]. Though there are some domain-specific differencesbetween these architectures, the same basic functional blocks are present. See Figure 6.WP515 (v1.0) November 5, 2019www.xilinx.com5

The Anatomy of an Embedded Machine Learning AcceleratorX-Ref Target - Figure 6Image QueueWeights DMA ControllerExecution ControllerSpill/Restore DMA ControllerInstruction BufferPooling/EWASystolic gPoolingPoolingCross BarWP515 06 100119Figure 6: Generic Embedded Accelerator ArchitectureThe following sections explore in detail the build-up of each functional block.Activation MemoriesThe activation memories are used to store input and output tensors being processed by the systolicarray. Data is read out of the activation memory in the order required to move through the 3D inputfeature map tensor, and is then written back into the activation memory, forming the 3D outputfeature map tensor. Some operations, such as Eltwise addition [Ref 2], require two 3D tensors andproduce a single 3D output. While the systolic array is active, DMAs are used to spill a previouslystored output feature map tensor and restore a new input feature map tensor. See Figure 7.WP515 (v1.0) November 5, 2019www.xilinx.com6

The Anatomy of an Embedded Machine Learning AcceleratorX-Ref Target - Figure 7Activation MemoriesImage QueueWeights DMA ControllerExecution ControllerSpill/Resore DMA ControllerInstruction BufferPooling/EWASystolic gPoolingPoolingCross BarWP515 07 091919Figure 7: Generic Architecture: Activation MemoriesThe activation memory requires a minimum of three read ports and two write ports, to avoidqueuing up requests. Such memories are easy to implement in programmable logic, using eitherblock RAM or UltraRAM, together with some external multiplexing to steer the five requests to thecorrect ports. Since both UltraRAM and block RAM can each have one write port and one read port,an activation memory can be made from a minimum of three of these components if the memoriesare wide enough. If the systolic array has many rows or if the data width is large, then the activationmemory can be made from multiples of three of these components, combining enough of thesemultiples to feed all the rows of the array.To avoid memory collisions of, for example, two reads from the same physical port, a smartcompiler can be used that allocates base addresses and ranges within the activation. So, thesystolic array could be reading from the first physical memory in a group of three, while the spillDMA is reading from the second. Because all the sizes of the tensors to be processed are known atnetwork compile time, it is possible to ensure that the memory accesses are safe by allocating eachtensor to a part of memory serviced by a different physical port.The activation memories can be adapted during accelerator construction over a number ofdimensions. The data type stored is typically one parameter (e.g., 8-bit or 16-bit), though it ispossible to design an activation memory capable of carrying multiple data types. Anotherparameter is the type of memory building blocks used (block RAM or UltraRAM). The finalparameter is the number of these memory building blocks used to implement the activationmemory, which then defines the storage capacity.One example is an activation memory capable of storing 8-bit data values, implemented usingUltraRAM, capable of producing sixteen values in parallel. Such an activation memory can be madewith two sets of three UltraRAMs, each configured as simple dual-port (SDP) memories. Since eachWP515 (v1.0) November 5, 2019www.xilinx.com7

The Anatomy of an Embedded Machine Learning AcceleratorUltraRAM SDP memory can produce eight 8-bit values (64 bits in total), two sets are needed toproduce the sixteen 8-bit values required. As described above, the minimum number of memorybuilding blocks required in each set is three; therefore, six UltraRAMs in total are required. Thismeans that even in a device as small and inexpensive as a Zynq UltraScale MPSoC XCZU4CG, eightactivation memories for eight distinct embedded accelerators can be implemented.Systolic ArrayContemporary deep learning algorithms are dependent on a large number of multiply-accumulateoperations. Architectures that are able to fuse multiply and add operations i

The Anatomy of an Embedded Machine Learning Accelerator ABSTRACT As machine learning (ML) has increased in popularity over the l ast half-decade, a number of accelerators for ML have been introduced, evaluated, and deployed on programmable logic. With their unique power/performance point, FPGAs are increasingly a popular