Deep Learning Binary Neural Network On An FPGA

Transcription

Deep Learning Binary Neural Network on an FPGAbyShrutika RedkarA ThesisSubmitted to the Facultyof theWORCESTER POLYTECHNIC INSTITUTEIn partial fulfillment of the requirements for theDegree of Master of ScienceinElectrical and Computer EngineeringbyMay 2017APPROVED:Professor Xinming Huang, Major Thesis AdvisorProfessor Yehia MassoudProfessor Thomas Eisenbarth

AbstractIn recent years, deep neural networks have attracted lots of attentions in the fieldof computer vision and artificial intelligence. Convolutional neural network exploitsspatial correlations in an input image by performing convolution operations in localreceptive fields. When compared with fully connected neural networks, convolutional neural networks have fewer weights and are faster to train. Many researchworks have been conducted to further reduce computational complexity and memoryrequirements of convolutional neural networks, to make it applicable to low-powerembedded applications. This thesis focuses on a special class of convolutional neural network with only binary weights and activations, referred as binary neuralnetworks. Weights and activations for convolutional and fully connected layers arebinarized to take only two values, 1 and -1. Therefore, the computations andmemory requirement have been reduced significantly. The proposed architecture ofbinary neural networks has been implemented on an FPGA as a real time, highspeed, low power computer vision platform. Only on-chip memories are utilizedin the FPGA design. The FPGA implementation is evaluated using the CIFAR10 benchmark and achieved a processing speed of 332,164 images per second forCIFAR-10 dataset with classification accuracy of about 86.06%.

AcknowledgmentsI would like to acknowledge and thank my advisor, Prof. Xinming Huang for allof his support, guidance and patience. I would like to express my sincere thanks tohim for giving me this opportunity to be a part of Intelligent Transportation group.I am thankful to Prof. Yehia Massaoud and Prof. Thomas Eisenbarth to be mycommittee members. I would like to express my gratitude for all those, who havesupported me throughout my work. I would like to especially thank my fellow teammember, Yuteng Zhou, for his valuable thoughts and support.At last, I would like express my sincere gratitude to my parents for believing inme and for their constant love and encouragement throughout this journey. I wouldalso like to thank all my friends and family members for their wishes and support.i

Contents1 Introduction11.1Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.2Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.3Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42 Neural Network Background:62.1Neural Network Basic Terms and Concepts . . . . . . . . . . . . . . .2.2Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . 102.362.2.1Convolutional layer . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3ReLU Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.4Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . 152.2.5Backpropagation and Fradient Descent . . . . . . . . . . . . . 15Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1Binary Neural Network[1] . . . . . . . . . . . . . . . . . . . . 213 Hardware Implementation253.1FPGA Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2ZYNQ ZC706 FPGA Platform . . . . . . . . . . . . . . . . . . . . . . 263.3Avnet HDMI Input Output FMC Module . . . . . . . . . . . . . . . 28ii

3.4Binary Neural Network Hardware Implementation . . . . . . . . . . . 293.5Top Level System Design . . . . . . . . . . . . . . . . . . . . . . . . . 354 Classification and Result4.138BNN Classification Result . . . . . . . . . . . . . . . . . . . . . . . . 385 Conclusion and future work435.1Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44iii

List of Figures2.1Single Neuron Model . . . . . . . . . . . . . . . . . . . . . . . . . . .62.2Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . .82.3First Convolutional Layer Operation . . . . . . . . . . . . . . . . . . 122.4Image size reduces with 3x3 filter size without zero padding and 1stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5Image size is retained with 3x3 filter with 1 border of zeros paddingand 1 stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6Pooling Layer Operation . . . . . . . . . . . . . . . . . . . . . . . . . 142.7Fully Connected Layer Structure . . . . . . . . . . . . . . . . . . . . 162.8Gradient Descent Graph of Error vs Individual Weight . . . . . . . . 172.9Example of Binary Neural Network . . . . . . . . . . . . . . . . . . . 223.1ZC706 FPGA Platform[2] . . . . . . . . . . . . . . . . . . . . . . . . 273.2HDMI Input Output FMC Architecture [3] . . . . . . . . . . . . . . . 283.3Architecture of Binary Neural Network . . . . . . . . . . . . . . . . . 303.4Line buffers to scan 3 3 pixels at a time . . . . . . . . . . . . . . . 313.5first convolution layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.6First Convolution Layer, multiply accumulate implemented by addition subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.72nd, 3rd and 4th convolution layer realization . . . . . . . . . . . . . 33iv

3.8Fully Connected Layer Realization . . . . . . . . . . . . . . . . . . . 343.9Example of AXI4-Stream interface transmission [4] . . . . . . . . . . 363.10 Top Level System Design . . . . . . . . . . . . . . . . . . . . . . . . . 374.1CIFAR-10 dataset [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . 38v

List of Tables3.1Resolutions supported by HDMI Input Output FMC Card [3] . . . . 294.1Resource utilization, when weights and activations are of fixed point8 bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.2Resource utilization, when weights are binarized and activations areof fixed point 8 bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3Resource utilization, when weights and activations are binarized . . . 404.4Parameters required by each layer of proposed network . . . . . . . . 414.5Resource utilization, when number of convolutional and fully connected layers are varied in the network . . . . . . . . . . . . . . . . . 42vi

Chapter 1Introduction1.1BackgroundIn the last few years, deep neural network has become an active field of research,because it has achieved outstanding results in the areas such as computer vision,voice recognition, natural language processing, regression, and robotics. Deep neuralnetworks are originally designed to model the structure of human brains. Humanbrain has a deep architecture of biological neural networks. These biological neuralnetworks can identify complex objects by first detecting simpler features and thencombining them to detect complex features. In a similar way, the artificial neuralnetwork identifies different objects by distinguishing simple patterns in the objectand then combining simpler patterns to recognize the complex patterns.Until the year of 2006, it was very difficult to train the neural network, becausewhenever neural network was trained, due to the problem of vanishing gradient,training was slow and error rate was quite high. Hinton, Lucan, and Banjio published three papers [6, 7, 8], which solved the problem of vanishing gradient. Afterthat many researchers achieved breakthrough results in various neural networks ap-1

plications. Today there are various types of deep neural networks available to handlevarious applications. For many real world applications, sufficient labeled data is difficult to find. For identifying patterns from unlabeled data, Restricted BoltzmannMachine (RBM) [9] and autoencoders [10] are often used. If patterns in the dataare changing with respect to time, then Recurrent Neural Net [11] can be a betterchoice to identify patterns. When data to be trained is available in the form ofimages, where spatial patterns have to be recognized, convolutional neural network[12] can be a great choice.In the recent years, Convolutional Neural Networks are the most widely usedneural network for deep learning. They provide very good accuracy for image classification problems. The key factor in increasing CNN accuracy over the years ismultiple stacks of convolutional layers and large training set [12]. Convolutional layers extract spatial patterns from images in hierarchical manner. First convolutionallayer extracts simple features such as lines, curves, edges and corners. The nextconvolution layer extracts more abstract features such as complex shapes made upfrom lines, curves, edges and corners. With more convolutional layer added in thestack of the layers, more abstract features can be extracted. At the end, using fullyconnected layers, classifier gives the scores of each class. Highest score refers to theclass of input image.The drawbacks of convolutional nets are complex computation and large memory requirement with increasing convolutional layers in the stack. Thus, GraphicProcessing Units (GPU) are often used as hardware processor [13] to implementconvolutional nets. GPUs can perform complex repetitive operations through massive parallelism. Thus, GPUs can handle large models of convolutional nets withlarge dataset. The main drawback of GPUs is that they consume a lot of power,which makes GPUs unsuitable for low power and real-time embedded applications.2

Embedded FPGA platforms have been widely used for real-time embedded systems. However, FPGA has limited computing resources and limited on-chip memory,which could cause problem for implementing the convolutional neural network. Inthis thesis, a binary neural network which uses significantly less memory than theconvolutional neural network is implemented on FPGA. The binary neural networkwas proposed by Coubariaux in 2016[1]. This network is derived from the convolutional neural network by forcing the parameters to be binary numbers. Hence, Itbecomes more suitable for hardware implementation than convolutional neural nets.1.2Motivation.Recently, there is been a great deal of interest in designing Advanced Driver Assistance System (ADAS). ADAS system is developed to assist the driver by notifyinghim about the probable problems and avoiding chances of vehicle accidents. Vehicledetection is a major task in ADAS. The result of vehicle detection can be used forapplications such as accident prevention, adaptive cruise control, and automatedheadlamp dimming. In the field of computer vision, for simple pattern recognition,logistic regression and SVM can be better choices. They give sufficiently good accuracy and are computationally less expensive than neural network. However, vehiclescan have a number of different shapes, angles, colors, and ambiance. This increasespattern complexity and for such complex pattern problems, a deep neural networkperforms better than the traditional classification models. For implementing classification at real time in ADAS system, reconfigurable nature of FPGA providesan advantage over ASIC- based implementation. Additionally, the cost and powerconsumption of FPGAs are relatively lower than CPUs and GPUs. By implementing binary neural network on an FPGA platform, we can make an efficient vehicle3

classification system which has the advantages of reconfigurability and better powerefficiency.1.3Thesis OutlineThis thesis is arranged into different chapters as follows.Chapter 1, Introduction:Introduction to the thesis objective is provided. Itexplains the motivation behind the research presented. The introduction of neuralnetwork and ADAS is also included for the readers.Chapter 2, Background: Related background information is given to understand neural networks. This chapter walks through the basic mathematical modelsof artificial neurons and gives information about how convolutional neural networkevolved for finding patterns in images effectively.Chapter 3, Literature Review: A literature review on the state of the art neural networks is provided with the real-time implementations on different hardwareplatforms such as CPUs, GPUs, FPGAs, and ASICs. At the end of the chapter, italso introduces the idea of binary neural network and provides theory and mathematical background to understand the concepts behind it.Chapter 4, Hardware Implementation: Proposed hardware architecture ofthe binary neural network is included. FPGA design for each of the binary neuralnetwork layer is presented. This chapter also specifies details about the System onChip (SoC) platform used for implementation of proposed design.4

Chapter 5, Classification Results: In this section, information about the datasetused for training and testing binary neural network is provided. Classification resultsand resource utilization are also presented in this chapter.Chapter 6, Conclusions and Future Work: This section draws the conclusionsof the thesis and explores future work in the research direction.5

Chapter 2Neural Network Background:2.1Neural Network Basic Terms and ConceptsThe neural network is inspired by the structure of the human brain. Human brainhas about 1011 neurons and these neurons are connected by about 1015 synapses.Every neuron has two types branches, the axon and the dendrites. A neuron receivesinput signals from its dendrites and it outputs signals using its axons. Branches ofaxons are then connected to dendrites of other neurons. In a similar way, theartificial neural network also consists of millions of neurons and it models biologicalneuron with the help of weight, bias and activation function. Fig. 2.1 describessingle neuron in an artificial neural network.Figure 2.1: Single Neuron Model6

This neuron receives 3 inputs x1 , x2 and x3 and computes activation functionPf wi xi b, where wi corresponds to the ith weight and b corresponds to thebias of that neuron. one of the common activation function which is used in neuralnetworks is the sigmoid function, as expressed mathematically in equation (2.1).The sigmoid function takes the real valued number and converts it to a value within0 and 1.σ(x) 11 ex(2.1)Another activation function which is often used in the neural network is thehyperbolic tangent function as shown in equation (2.2). It constrains input signalin the range of -1 to 1.tanh(x) ex e x 2σ(2x) 1ex e x(2.2)In recent years, Rectified Linear Unit (ReLU) has become one of the popular activation function in the deep neural network. ReLU function is represented as shownin the equation (2.3). ReLU function is not continuously differentiable or bounded,unlike sigmoid and tanh functions. It works better in deep networks because itexpedites stochastic gradient descent convergence when compared to sigmoid andtanh function.f (x) max(0, x)(2.3)A generic neural network model is shown in Fig. 2.2. It consists of an input layer,an output layer and a number of hidden layers. Every neuron receives an input, itperforms dot product between input and its weights, adds the bias and applies anactivation function and sends the output to other neurons. Input layer receives7

input data, each neuron in the input layer does the same processing and sends theoutput to the first hidden layers. The Hidden layer does the same processing andsends the output to next hidden layer. This process is repeated until the rightmostlayer, also called as output layer is reached. At the output layer, scores for eachclass is computed and the object is classified, with the highest score representingthe class of input image. The entire process of beginning from the input layer toconverting signals into output score is called forward propagation.Figure 2.2: Forward PropagationIn this example of forward propagation model, the leftmost layer is an inputlayer, the rightmost layer is an output layer and the model has only one hiddenlayer in between input and output layer. Weights corresponding to ith hidden layersiare denoted as wmn, where m corresponds mth neuron in the previous (i 1)th layerand n corresponds to nth neuron in the current ith layer. Every neuron in the ithhidden layer has its own bias bin . Activation from every neuron in the hidden layeris calculated as shown in the equation (2.4), (2.5), (2.6) and (2.7).8

111ay11 f (w11x1 w21x2 w31x3 b11 )(2.4)111ay21 f (w12x1 w22x2 w32x3 b12 )(2.5)111x3 b13 )x2 w33x1 w23ay31 f (w13(2.6)111x3 b14 )x2 w34x1 w24ay41 f (w14(2.7)In training of neural network, after the forward propagation, the loss or costis calculated, which is the difference between predicted output score and groundtruth table. In the training process, the next step is to tweak weights and biases sothat the loss is minimized. This process is called optimization. The gradient of thecost with respect to weights and biases gives the rate at which weights and biasesshould be changed. The process of computing gradient of the cost with respect toweights and biases in the entire network is called backpropagation. The gradient iscomputed repeatedly and parameters are updated accordingly. This process is alsoknown as gradient descent. In the backpropagation, the gradient at every neuron iscalculated using the gradient chain rule going from output layer backward to inputlayer.There is another kind of parameters called hyperparameters involved in machinelearning. They decide higher level settings of neural network model such as rateof change and complexity of the model. One of the common hyperparameters usedin deep neural networks is a learning rate. Learning rate decides the step size forthe parameter update along the direction of the gradient. Learning rate has to bechosen carefully because if learning rate is too small, the convergence of the networkfor finding suitable weights will be slow and if it is too large, it may give us higherloss because of the less precision in step size.9

2.2Convolutional Neural NetworkThe convolutional neural network works on the same principle of the neural network.These nets also have an input layer, an output layer, and a number of hidden layers.Similar to any other neural network, every neuron in the convolutional networkreceives input data, it performs dot product between input data and weights, addsthe bias, applies the activation function and sends the output to other neurons. Theoutput from one layer is used as input to the next layer. At the end of propagation,scores of the classes are computed similar to any fundamental neural network. Theconcepts of backpropagation also remain the same in the convolutional network. Themain difference between convolutional neural net and any other neural net is thatconvolutional neural net takes advantage of the fact that input consists of images,by arranging its neuron in 3 dimensions corresponding to width, height, and depthof the input image. In fact, the convolutional neural net can be used on any datawhich can be arranged in the form of spatially correlated structures such as images.In other words, if the available data is represented in the form of image structuresuch that spatially closer information is more related to each other than spatiallyfarther information, the convolutional neural net can classify pattern among suchdata with good accuracy.2.2.1Convolutional layerThe convolutional layer is the core building block of the convolutional network.Neurons in this layer are not connected to every neuron in the previous layer. Instead, every neuron is connected to the local region in the previous layer. Neuronsin the convolutional layer are like a set of filters. Every filter is very small alongthe width and height, but every filter extends along the full depth of the input acti-10

vation. During forward propagation, every filter slides across the entire image, andperforms convolution between filter elements and the corresponding local regions inthe input activation and produces two-dimensional output activation which is theconvolutional output of the filter at every spatial position of the input activation.When filter slides across the image, it gets activated, when it is convolved with thecertain type of image features such as line, curve, edge, corner or a certain combination of colors. Every filter tries to identify different features from given input andstores the result in one 2-D activation map. N number of filters generates N 2-Dfeature maps. These N feature maps are joined along the depth to make one 3-Doutput activation map, which is then used as input for the next layer.In other words, convolutional layer receives input feature map and weights forthat layer and it performs the 3-D convolutional operation, as described mathematically by equation (2.8).Y [n, i, j] D 1X K 1XX K 1W [n, d, 2 x, 2 y] X[d, i x, j y](2.8)d 0 y 0 x 0In this expression, input feature map is of size D W H and output featuremap is of size N W H, where N is the number of feature maps. K K is thekernel size. Above expression gives(i, j) the value of nth feature map. For example,in our first layer implementation, input feature map is an image of size 3 32 32and the kernel is of size 3 3 and there are 128 weights of size 3 3 and depth 3,which then generates output feature map of size 128 32 32. This convolutionaloperation is shown in Fig. 2.3.Important hyperparameters in the convolutional layers are a number of filters,the size of the filter, stride, and zero-padding. The number of filters corresponds tothe depth of the output activation. Each filter looks for some different visual feature11

Figure 2.3: First Convolutional Layer Operationin the input. A number of filters determine the number of features that convolutionallayer is extracting. The size of the filter is also called as the receptive field of theneuron. It can be anything like 3 3, 5 5, 7 7 etc. The size of the filter is alwayssmaller in width and height as compared to width and height of input activation.But, depth of the receptive field is always same as that of input activation. Stridecontrols how the filter slides across the input volume. Stride decides the amount bywhich filter shifts. If the stride is 1, then the filter shifts every time by one pixel.If the stride is 2, then the filter shifts by 2 pixels every time. The amount of stridealso determines how much output volume would shrink. If the stride is increased,then overlap between two adjacent filters decreases. Zero padding is basically usedto control the size of output activation with respect to input activation.For example, consider if we have 32 32 3 input image, if we use a filter of size3 3 3 with a stride of 1 without zero padding, it would give output activationof 30 30 3. Thus, the output volume has shrunk. If it is necessary to keep theoutput activation size same as the input activation, zero padding of one border canbe used. Then, it would keep the output activation size to 32 32 3. Figure 2.4and 2.5 demonstrate the significance of zero padding and stride with 2-D input andoutput activations12

Figure 2.4: Image size reduces with 3x3 filter size without zero padding and 1 stride2.2.2Pooling LayerAnother layer, which is often used in between successive convolutional layers isa pooling layer. Pooling layer is used to reduce the spatial dimension of inputactivation layer. There are different kind of pooling layers used such as averagepooling layer. In this paper, max pooling layer is used to reduce the dimensionof input activation by applying simple maximum function. The example of maxpooling layer application is shown in the Fig. 2.6.Important hyperparameters involved in pooling layer is window size and windowstride. The window of constant size is applied to each 2-D map in the input activation independently and maximum operation is carried out. Figure 2.6 shows anexample of max pooling, where window size was 2 2 and stride was 2. Poolingreduces dimension of input image from 4 4 to 2 2. Reduction in spatial dimensions reduces the number of parameters required in the next convolutional layers,which in turns, reduces memory requirement and computation cost for next convolutional layers. Additionally, pooling layer also helps in controlling overfitting. Inthe case, when trained neural network gives very good accuracy on trained data, but13

Figure 2.5: Image size is retained with 3x3 filter with 1 border of zeros padding and1 strideFigure 2.6: Pooling Layer Operationgives a way lesser accuracy for test data, that occurrence is referred as overfitting ofthe neural network. Applying pooling layer in between other neural network layersprovides distortion and scale invariance which helps in controlling overfitting.2.2.3ReLU LayerThe non-linear activation function is applied after almost every convolutional andfully connected layer in most of the neural networks. There are different types ofnon-linear functions used by different convolutional networks. Some of the impor14

tant non-linear functions are discussed in Section 2.1. Rectified Linear Unit functionhas been proven to give better results in the neural network than other non-linearfunction[14], because it requires lesser computational time and it also gives performance improvement when used along with some regularization scheme like dropout[15]. Regularization schemes are used to control overfitting of the neural network.In the dropout, during the training process, randomly some activation units areset to zero. This breaks up the co-adaptation of units, which results in preventingoverfitting.2.2.4Fully Connected LayerEach neuron in the fully connected layer has connections to all the neurons in theprevious layer. However, the fully connected layer does not take into account spatialproperties of images. Fully connected layer converts a list of the feature maps intoa list of class scores. So, there can’t be any convolutional layer after fully connectedlayer. The Fig. 2.7 shows an example of several fully connected layers connectedto each other. Hyperparameter involved in fully connected layers is a number ofneurons in the fully connected layer. Generally, a stack of fully connected layers isused at the end of the neural network.2.2.5Backpropagation and Fradient DescentBackpropagation is a common method [16] which is used along with some optimization method such as gradient descent to train the neural network. In otherwords, backpropagation is a way by which weights and biases in the convolutionaland fully connected layers are adjusted so that neural network is trained to identifya particular object. When training the neural network for the first time, weightsare randomly initialized. In the forward propagation, image from training dataset is15

Figure 2.7: Fully Connected Layer Structurepassed through neural network to generate the class score with the randomly initialized weights. Then, the loss function is calculated by comparing generated outputwith the targeted output. The loss is usually high for the first couple of trainingdata. The aim of backpropagation is to minimize the loss by tweaking the weightsand biases.To find the direction in which weight should be changed to minimize the cost,the gradient of loss function with respect to that weight is calculated. Thus, inthe backpropagation, the gradient of loss function with respect to every weight iscomputed using the chain rule. Once derivatives with respect to every weight arecomputed, then weights are changed in the direction of gradient descent. This laststep is referred as parameter update. Gradient descent process is depicted in thegraph as shown in Fig 2.8, where weight is adjusted down in the direction of thegradient to minimize the error.16

Figure 2.8: Gradient Descent Graph of Error vs Individual WeightThe hyperparameter involved in the backpropagation is the learning rate (η).The choice of the learning rate decides how far along the gradient direction, weightshould change. Learning rate is a tricky parameter and should be chosen carefully.Because, if learning step is too high, then bigger steps are taken in parameter update and the network will converge fast, but this also could result in insufficientprecision to reach the optimal value of weight or it could lead to higher loss due tooverstepping. If learning rate is too slow, that means weight training will be slower.The network will take more time to reach to the optimal values of weight.2.3Related WorkSimonyan and Zisserman investigated the performance of the convolutional networkby increasing the depth of the network [17]. They have shown that by increasingdepth of the convolutional network to 16-19 weight layers, the performance of thenetwork can be increased substantially. But, with increasing depth of convolutionallayers, computational cost and memory requirement of the neural network also increase. Graphic processing units (GPUs) have become solutions to implement convolutional nets at high speed and to meet such heavy computational requirements17

[18]. However, for many real-time embedded applications, high power consumptionof GPU is not feasible. For low-power neural network applications, implementationof the pre trained convolutional neural network on embedded FPGA is a promisingsolution. But, FPGAs have limited on-chip memory resources. Thus, implementation of all of the convolutional network will require external memory to storepre-trained weights and biases. But, even if the external memory is used, the limited bandwidth of FPGA could lower the speed of the neural

neural network for deep learning. They provide very good accuracy for image clas-si cation problems. The key factor in increasing CNN accuracy over the years is multiple stacks of convolutional layers and large training set [12]. Convolutional lay-ers extract spatial patterns from images in hierarchical manner. First convolutional