Mobile And Embedded Deep Learning - Ocw.snu.ac.kr

Transcription

Mobile and EmbeddedMachine LearningIf you have the power to make someone happy, do it.The world needs more of that

Overview Objective Content To understand the opportunities to apply machine learningand deep learning techniques for mobile applicationsMachine learning for mobile and IoT applicationsIntro to convolutional neural networkDeepMon: Mobile gpu-based deep learning framework forcontinuous vision applicationsAfter this module, you should be able to Understand the basics of machine learning and deeplearning techniques for mobile and IoT applications

Mobile and IoT SensingContinuous sensing and analytics of user activities, location, emotions,and surroundings with mobile/IoT/wearable devicesSensingDevicesSensingFeature cel/GyroFFTHMMLocationWiFi SignalNormkNNGroupStressCNN/RNNHeart rateContinuous Pipelined ExecutionInferredContexts Summarizedfeatures Raw sensordata readings

Revisit Activity Recognition

Simple Heuristic If STDEV(y-axis samples) CThreshold1 If AVG(y-axis samples) CThreshold2 output standing Else output sitting Else If FFT(y-axis samples) CThreshold3 output walking Else output jogging

Are We Good? How do we determine good features and goodthresholds? How do we know STDEV is better than MAX? How do we know AVG is better than Median? How do we know the right values for Cthreshold ? What if a user puts her phone in her bag, not in herfront pocket? The Y-axis of the phone is not anymore the major axis ofmovement. How do we solve these problems? A better heuristic?

Decision Tree A simple but effective ML classifier. This tree can be built by the C4.5 algorithm. Given sufficient training data, the algorithm canautomatically determine the important featuresand their thresholds.

Other ML Techniques Naïve Bayes classifier Decision tree Random forest Support vector machine kNN algorithm Linear regression

ML Techniques Flow

ML Techniques: Limitations Linear regression? Why is it linear? Bayesian? What is the prior? SVM? What are the features? Decision tree? What are the nodes/variables? KNN? Cluster on what features?These methods do not suitwell with very complex models.

Deep Learning

Deep Learning for Activity Recognition Example of applying a convolutional neural network

Deep Learning ApplicationsSelf-DrivingFace RecognitionSpeech RecognitionPlay Go

Deep Learning for Speech RecognitionFrequencyCNNThe filters move in thefrequency direction.ImageTimeSpectrogram

Machine Learning vs. Deep Learning Deep learning: the more data, the higher accuracy

Introduction toConvolutionalNeural Network (CNN)

Convolutional Neural Network CNN is a feed-forward network that can extract topologicalproperties from an image. Like almost every other neural networks they are trainedwith a version of the back-propagation algorithm. Convolutional Neural Networks are designed to recognizevisual patterns directly from pixel images with minimalpreprocessing. They can recognize patterns with extreme variability (suchas handwritten characters).

Feed-Forward NetworksInformation flow is unidirectionalData is presented to Input layerPassed on to Hidden LayerPassed on to Output layerInformation is distributedInformation processing is parallelInternal representation(interpretation) of data

Identifying a Bird in an Image Let’s assume “beak” is unique to birds. “beak” exists in a small sub-region of an image.“beak” detector

“Beak” in Different Parts of Images“upper-leftbeak” detector“middle beak”detector

Convolutional LayerA Convolutional Neural Network (CNN) is a neural networkwith “convolutional layers”, which has a number of filters thatdoes convolutional operation.Beak detectorA filter

ConvolutionThese are the networkparameters to be learned.1001000001101 -1 -1-1 1 -1-1 -1 1010001100100011000-1-111-1-1001010-11-1Filter 2 6 x 6 imageFilter 1Each filter detectsa small pattern (3 x 3).

1 -1 -1-1 1 -1-1 -1 1Convolutionstride 11001000001100100011001000110000010106 x 6 imageDotproduct3-1Filter 1

1 -1 -1-1 1 -1-1 -1 1ConvolutionIf stride 21001000001100100011001000110000010106 x 6 image3-3Filter 1

1 -1 -1-1 1 -1-1 -1 1Filter 13-1-3-1Convolutionstride -2-2-16 x 6 image

-1-1-1Convolution111-1-1-1Filter 2stride 1100100000110010001100100011001016 x 6 imageRepeat this for each re-3 Map0-1-2-20-2-411-13Two 4 x 4 imagesForming 2 x 4 x 4 matrix

Color Image: RGB 3 ChannelsColor image-1-1 11 -1-11 -1 -111 -1-1 -1-1-1 1 -1-1-1-1 11 -1-1-1-1 111 -1-1-1 Filter 2-1 1 -1 Filter 1 -1 1 -1-1-1 -1-1 11-1-1 11 -1-1-1 -1 11 0 0 0 0 11 0 0 0 0 10 11 00 00 01 00 10 1 0 0 1 00 00 11 01 00 10 00 0 1 1 0 01 00 00 10 11 00 01 0 0 0 1 00 11 00 00 01 10 00 1 0 0 1 00 00 11 00 01 10 00 0 1 0 1 00 0 1 0 1 0

How to Form a Feed Forward Network1000010100100011001000100100100010101 -1 -1-1 1 -1-1 -1 010010010001010x2x36 1x1 Fullyconnected-1

1 -1 -1 Filter 1-1 1 -1-1 -1 11 1100100 0010010010000101111000006 x 6 imagefewer parameters!308 19 010 0 0100102 03 04 013 014 015 116 1 Only connect to9 inputs, notfully connected

1: 11 -1 -1-1 1 -1-1 -1 10010010010000101111000007: 08: 19: 010: 0Even fewer parameters16: 1 Fewer parameters13: 014: 015: 16 x 6 image3-1 0100102: 03: 04: 0 100100Filter 1Shared weights

The Whole CNNcat dog ConvolutionMax PoolingCan repeatmany timesFully ConnectedFeedforward networkConvolutionMax PoolingFlattened

Max Pooling1 -1 -1-1 1 -1-1 -1Filter 11-1-111-11-1-1 Filter 1-10-43

Why Pooling Subsampling pixels will not change the objectbirdbirdSubsamplingWe can subsample the pixels to make image smallerfewer parameters to characterize the image

Max Pooling100100000110010001100100011000001010New imagebut smallerConvMaxPooling3-10310132 x 2 image6 x 6 imageEach filteris a channel

The Whole CNN3-103101Convolution3Max PoolingCan repeatmany timesA new imageSmaller than the originalimageThe number of channels isthe number of filtersConvolutionMax Pooling

The whole CNNcat dog ConvolutionMax PoolingA new imageFully ConnectedLayerConvolutionMax PoolingFlattenedA new image

Fully Connected LayerConceptually, this can be understood as the voting processto see which input values contribute more to the output.3013-10310133Flattenedcat dog -1103Fully-connectedFeedforward network

Tools and APIs Tensorflow (https://www.tensorflow.org/) Tensorflow light for Mobile and IoT PyTorch (https://pytorch.org) Caffe2 (https://caffe2.ai) Keras (https://keras.io/)

CNN in KerasOnly modified the network structure and input format (vector - 3-D tensor)input1-1-1-1-1-1 11 -1-1 1-1 1-1 1Convolution-1-1-1 There are 253x3 filters. Max PoolingInput shape ( 28 , 28 , 1)28 x 28 pixels1: black/white, 3: RGB3-1-313ConvolutionMax Pooling

Only modified the network structure and input format (vector - 3-D array)CNN in KerasInput1 x 28 x 28ConvolutionHow many parameters foreach filter?925 x 26 x 26Max Pooling25 x 13 x 13ConvolutionHow many parametersfor each filter?225 25x950 x 11 x 11Max Pooling50 x 5 x 5

CNN in KerasOnly modified the network structure and input format (vector - 3-D array)Input1 x 28 x 28OutputConvolution25 x 26 x 26Fully connected feedforward networkMax Pooling25 x 13 x 13Convolution50 x 11 x 11Max Pooling125050 x 5 x 5Flattened

AlphaGoNeuralNetworkNext move(19 x 19positions)19 x 19 matrixBlack: 1white: -1none: 0Fully-connected feedforward networkcan be usedBut CNN performs much better

AlphaGo’s policy networkThe following is quotation from their Nature article:Note: AlphaGo does not use Max Pooling.

CNN in speech recognitionFrequencyCNNThe filters move in thefrequency direction.ImageTimeSpectrogram

CNN in text classification?Source of image: http://citeseerx.ist.psu.edu/viewdoc/download?doi 10.1.1.703.6858&rep rep1&type pdf

DeepMon: Mobile GPU-basedDeep Learning Framework forContinuous Vision ApplicationsACM MobiSys 2017

7Continuous Vision Applications

Conventional Processing FlowJennyCapture frames& Process(with DNN modelssuch as YoLo)Privacy &Process framesNetwork(with DNN models)ProblemsResearch Question: Can we support fully-disconnectedDNN-based inference purely on the mobile device?48

9DeepMon: Mobile Deep Learning System Supports low-latency execution of CNNs on commoditymobile devices using mobile GPUs OpenCL/Vulkan enabled devices Supports multiple GPU architectures & mobile OSs Mali, Adreno, PowerVR (to be supported) Supports existing trained models Multiple frameworks (Caffe, Matconvnet, Yolo, Darknet) Available today- https://github.com/JC1DA/deepmon

0Challenge 1: Mobile GPU is Weak!DNNs can well run on desktop GPUs, but Adreno 330Nvidia GTX 9Ratio (Nvidia /(Samsung Note80Adreno)4)Number of ALUs (CUDAcores)2048128Memory Bandwidth12.8224(GB/s)Too high latency to(Shared)support continues vision.Peak Performance (Gflo4600166.5ps)CNN Execution Time VG 238.59 (CaffeG-16 (ms))631516x17.5x27.6x 26.5x

1Challenge 2: Architecture is Different! Existing GPU-based optimizations won’t simply work!CPUDesktop GPUGPUCPUGPUMobile /03/NVIDIA-Maxwell-Unified-Virtual-Memory.jpg

2Identifying Latency Bottleneck Convolutional Neural Network Device: Samsung Galaxy Note 4 Implementation: naïve CPUModelConv. (ms)FC. (ms)Pooling (ms)Total 6213371240888221662 90% time consumed by Convolutional Layers

3Problem 1: High Memory Use Fast convolution operations on desktop GPU Use optimized general matrix multiplication (GEMM) Build up a new matrix by unfolding input matrixNo good GEMM onmobile GPUsMatrix building overhead( 150ms per layer withCaffe on Samsung S7) 0 0 0 0 1 1 0Increased memoryuse 1 (one value in input matrix2 is copied 9 times (3x3))https://i.stack.imgur.com/GvsBA.jpg

4Solution 1: mGPU-Aware Optimizations Do convolution operations directly on input No matrix building overhead Less memory consumption Do mGPU-aware Optimizations Leverage local memory (high performance cache inside GPU) toreduce memory reading Store reusable convolutional kernels inside the local memory It will be shared across multiple threads Layout the input data to enable fast vector addition/ multiplicationon mGPUs The data in vectors need to be consecutively stored in the memory. Use half floating point (32 bits 16 bits)

5Impact of Memory Optimizationmeasured on Samsung s7 (Mali T770)Latency (ms)10000VGG-16912080002.6%accuracy loss60003.06x6.78x40000%accuracy loss29762.21x200013440CaffeDeepMon(LM LR)DeepMon(LM LR HF)LM: Local Memory – LR: Layout Redesign – HF: Half Floating point

6Problem 2: Redundant ComputationSimilar regions- Background in continuous video frames tends to be static.- Independent processing of each frame is redundant.aKey idea: Can we reuse the intermediate results of previousconvolutional layer computation for similar regions?

Solution 2: Convolutional Caching- Light-weight & accuratecomparisons required!- SIFT-based approach didnot edConv. Op.resultsCacheManagerReusable?no- 16 bins color histogram- Chi square distance- If distance 0.005, blockis reusablePerformConv. Op.57

Solution 3: Decomposition To decompose large convolutional layer into asequence of several smaller ones so computation costcan be reduced Tucker-2 decomposition Decompose a convolutional layer into 3 parts 2 with filter size of (1x1) 1st layer acts as dimension reduction - reduce computational cost 2nd layer acts as dimension restoration - guarantee output size equalto output size of original convolutation layer 1 with original filter size have lower number of input/output channels - reducecomputational cost

Tucker-2 Decomposition N: number of filters C: number of input channels D: filter size (D 3) Input: ���1 𝑥𝐶𝑥1𝑥1Total ��𝐷𝐻𝑥𝑊𝑥𝑁1 𝑥𝐶Speedup:Conv-k-2𝑁2 𝑥𝑁1 𝑥𝐷𝑥𝐷Conv-k-3𝑁𝑥𝑁2 𝑥1𝑥1𝐻𝑥𝑊𝑥𝑁2 𝑥𝑁1 𝑥𝐷𝑥𝐷 𝐻𝑥𝑊𝑥𝑁𝑥𝑁2𝑁𝐶𝐷2𝑁1 𝐶 𝑁1 𝑁2 𝐷2 𝑁𝑁2

0DeepMon Performance: Latency100009120VGG-16Latency o134417739126441006X0CaffeDeepMon (MO)DeepMon(MO HF)MO: Memory Opt.HF: Half Floating pointDataset: UCF-101 (13K short video clips)DeepMon(MO HF CC)DeepMon (All)CC: Convolutional Caching

1DeepMon Performance: Accuracy10089.9Top-5 RecognitionAccuracy (%)83.94CaffeDeepMon808063.4604058.146040 6% loss202000VGG-16YOLOAll Optimization techniques are applied for DeepMon.Dataset: (1) ILSVRC2012 for VGG-VeryDeep-16,(2) the Pascal VOC 2007 for YOLOmean Average Precision(%)100

2Conclusion DeepMon is an easy to use framework Supports existing deep learning models Supports commodity mobile devices & OS’s Supports various optimizations to reduce latency Memory loading optimizations Convolutional caching Decomposition Achieve speedup of 14x over Caffe with minimal accuracyloss ( 6%) DeepMon implementation in OpenCL/Vulkanhttps://github.com/JC1DA/deepmon

Mobile and Embedded Machine Learning If you have the power to make someone happy, do it. The world needs more of that. Overview Objective To understand the opportunities to apply machine learning and deep learning techniques for mobile applications Content Machine learning for mobile and IoT applications Intro to convolutional neural network DeepMon: Mobile gpu-based deep learning