Using Machine Learning On FPGAs To Enhance Reconstruction Output - Indico

Transcription

Using Machine Learningon FPGAs to EnhanceReconstruction OutputIRIS-HEPFebraury 13th, 2019Dylan Rankin [MIT]On behalf of the hls4ml team

Introduction Machine learning has become a common tool for broad spectrum ofproblems (industry & physics)–Particle/signal identification–Image/speech recognitionMeanwhile, field-programmable gate arrays (FPGAs) have been used fordecades to provide fast computing solutions–Development typically requires large initial investment (learning VHDL/Verilog,hardware cost)–Complex algorithms can be very difficult to implementhls4ml is a tool which facilitates implementing machine learning on FPGAsfor fast inference [arXiv:1804.06913]–Provides possibility for highly customizable solutions to many HEP trigger problems2

Machine Learning Machine learning algorithms, especially deep neural networks, are becoming moreand more common in HEP–Esp. LHC, neutrinos Provides capability to analyze very complex problems in straightforward way Very good performance even for difficult tasks Networks can become very large long inference timesDeepAK8 (top-tagging)BDTDNN3CMS-DP-2018-046

Neural NetworkWm,m-1 Start with input vector (x1)Using weight matrix (W), biasvector (b), and activationfunction (g), transform inputvector to intermediate resultvector (xm)– Can be repeated many timesLast layer provides outputvector4

Neural NetworkWm,m-1VGG16 Start with input vector (x1)Using weight matrix (W), biasvector (b), and activationfunction (g), transform inputvector to intermediate resultvector (xm)–Can be repeated many times Last layer provides outputCan have 100s of millions of parametersvector5

LHC Data Processing L1 Trigger (hardware: FPGAs)– HLT (software: CPUs)– O(μs) hard latency. Typically coarse selection, BDT used for muon p T assignmentO(100 ms) soft latency. More complex algorithms (full detector informationavailable), some BDTs and DNNs usedOffline (software: CPUs)– 1 s latencies. Full event reconstruction, bulk of machine learning usage in CMS6

LHC Data Processing DNNs have the potential to greatly improve physics performance inthe trigger systemIn order to implement an algorithm, need to ensure inferencelatencies of μs (ms) for L1 (HLT)– For L1, this means we must use FPGAsHow can we run neural network inference quickly on an FPGA? 7

FPGAs Field-programmable gate arrays are a common solution for fast-computing– Ability to re-program for target needs is very appealingBuilding blocks:–Multiplier units (DPSs) [arithmetic]–Look Up Tables (LUTs) [logic]–Flip-flops (FFs)[registers]–Block RAMs (BRAMs)[memory] Algorithms are wired onto the chip Run at high frequency - O(100 MHz)– Programming traditionally done in Verilog/VHDL– Can compute outputs in O(ns)Low-level hardware languagesPossible to translate C to Verilog/VHDL usingHigh Level Synthesis (HLS) toolsVirtex 7 XC7VX690T3600 Multipliers400K LUTs800K FFs10 Mb BRAMVirtex Ultrascale VU9P6800 Multipliers1M LUTs2M FFs75 Mb BRAM8

Inference on an FPGAWm,m-19

Inference on an FPGAWm,m-1MultiplierUnitUp to 6k parallel operations!(#Multiplication Units)10

Inference on an FPGAWm,m-1MultiplierUnitLUTs, FFs, BRAMSUp to 6k parallel operations!(#Multiplication Units)11

Inference on an FPGAEvery clock cycle(all layer operations can beperformed simultaneously)Wm,m-1MultiplierUnitLUTs, FFs, BRAMSUp to 6k parallel operations!(#Multiplication units)12

hls4ml is a software package for creating HLSimplementations of neural networks– /Supports common layer architectures and modelsoftwareHighly customizable output for different latencyand size needsSimple workflow to allow quick translation to HLS13

Project Configuration (Keras)keras-config.ymlKerasJson: example keras model files/KERAS 1layer.jsonKerasH5: example keras model files/KERAS 1layer weights.h5OutputDir: my hls testProjectName: myprojectXilinxPart: xcku115 flvb2104 2 iClockPeriod: 5IOType: io parallel # options: io serial/io parallelReuseFactor: 1DefaultPrecision: ap fixed 16,6 Configuration file takes model architecture and weights files asinputMain customization options:–ReuseFactor: calculations per multiplier per layer (parallelization)–DefaultPrecision: used for inputs, weights, biasespython keras to hls.py c keras config.yml14

Customization: Reuse For lowest latency, compute all multiplications for a given layer at once– Larger reuse implies more serialization– Reuse 1 (fully parallel) latency # layersReuse # weights (fully serialized) latency (# weights) x (# layers)Allows trading higher latency for lower resource usage15

Customization: Precision hls4ml uses fixed point classes for allcomputationsPrecision can be adjusted as needed for desiredaccuracy, performance– Also impacts resource usageDefault behavior is to use same precision for alllayers16

Design Workflow Design model with standard software tools (Keras, Tensorflow, PyTorch)Pass network architecture and weights/biases along with configurationparameters to hls4ml (creates HLS project)Interface HLS code with desired project17

Jet Classification Example Perhaps an unrealistic example for L1 trigger, lessons are useful Problem certainly a clear candidate for ML usage18

Example Network16 expert inputs64 nodes (ReLU)32 nodes (ReLU)32 nodes (ReLU)5 outputs(softmax)195k weights 5k multipliers

Reducing Network Size:Compression Compression– Removing nodes or connections from networkTo identify redundant connections, we use a method of successive retraining and weightminimization (pruning)–Use L1 regularization, modify loss function with penalty term for large weights–Remove smallest weights–RepeatHLS automatically removesmultiplications by 0!20

Reducing Network Size:Compression Compression– Removing nodes or connections from networkTo identify redundant connections, we use a method of successive retraining and weightminimization (pruning)–Use L1 regularization, modify loss function with penalty term for large weights–Remove smallest weights–RepeatHLS automatically removesmultiplications by 0!21

Reducing Network Size:Compression Compression– Removing nodes or connections from networkTo identify redundant connections, we use a method of successive retraining and weightminimization (pruning)–Use L1 regularization, modify loss function with penalty term for large weights–Remove smallest weights–RepeatHLS automatically removesmultiplications by 0!7th iteration70% of initialweights removed22

Reducing Network Size:Compression Compression– Removing nodes or connections from networkTo identify redundant connections, we use a method of successive retraining and weightminimization (pruning)–Use L1 regularization, modify loss function with penalty term for large weights–Remove smallest weights–RepeatHLS automatically removesmultiplications by 0!Greatly reducedresource usageSlightly reducedlatency23

Reducing Network Size:Quantization Quantization– Software assumes all computations performed with floating point arithmetic– Not always necessary for desired performanceReduction of precision automatically zeros very small weights ( w 2 -fractional )– Reducing the bit precision used for NN arithmeticAlso reduces resources needs to compute/storemultiplications and intermediate layersFull performance at 8 integer bits, 8 fractional bits24

Network Tuning Compression & quantization can be usedtogether, maintain full performance25

Network Tuning: Reuse Reduces multiplier usage at the cost of increasing latency(and initiation interval)– Scales as expectedMinimal effect of reuse on LUTs and FFsFor reuse 1, 16,6 precision, find total resource usagewell below available resources for target FPGA (KU115)26

Synthesis vs. Implementation All previous results come from HLS synthesis estimates Known differences between HLS estimates and final implementation For slightly smaller model (this slide): –FF/LUT – overestimated in most cases–Multipliers – accurate below max width of multiplier input, overestimated aboveAlso able to meet timing constraints27

Under Development Large amount of ongoing development with hls4ml Expanding tool capabilities–Working on adding full support for: Conv1D and Conv2D layers (partial support)LSTM/GRU (testing)Graph neural networks (prototyping)Binary/ternary dense networks (partial support)Pooling (prototyping)Boosted decision trees (testing)–Working on ability to handle larger networks–Stay tuned for updates!Multiple potential use cases for LHC trigger systems28

ML in the CMS Trigger CMS endcap muon trigger uses BDT formuon pT assignment– Large DRAM bank on-board to implementLUTWork ongoing to investigate replacingBDT with DNN–Developing with hls4ml–CPAD 2018 TalkCan use two output nodes tosimultaneously do pT assignment and PUdiscriminationImplementation fits comfortably in a VU9P–Algorithm latency is 72 ns–41% DSP usageAlso work done for CMS trigger upgradeusing hls4ml for tau lepton ID–Range of additional applications for particleID29

Co-processors Increasing popularity of co-processor systems–CPU connected to a FPGA/GPU/TPU–Common setup for FPGA connects to CPU through PCI-expressAllows algorithms to run on the most optimized hardware(“acceleration”)FPGA-CPU co-processor machines are available as anoffering on Amazon Web Services (AWS)–F1 instances (connected to a Virtex Ultrascale VU9P) can beused to explore possibilities for accelerated inference30

Acceleration with AWS Development for FPGA kernel and CPUhost code is done with SDAccelenvironment– Invokes Vivado HLS under the hood, producestraditional synthesis reports etc.Run host code on CPU, manages datatransfer and FPGA kernel executionhls4ml project only needs to be wrappedto provide specific inputs/outputs forSDAccel to interface properly–Can be done generically–Have accelerated variety of hls4ml projects onAWS F1InputsPCI expressCPUFPGAC driver codeFPGAkernelPost-processingLimited in speed by I/O bandwidth31Outputs

An Acceleration Case Study (1) HCAL reconstruction at CMS currently uses an iterative fit to extract in-time energy– 15% of HLT processing timeSimilar procedure used for ECAL reconstruction–10% of HLT processing time Situation expected to worsen in more intense conditions We have begun investigating machine learning alternatives–Similar inputs to current algorithm (energies η/φ/depth)–Comparable performance with simple network (3 layers, 15/10/5 nodes)InputsTS3 pulseTS4 pulseTS5 -timeenergy32

An Acceleration Case Study (2) Have successfully implemented/run the network inference on AWS ccelFPGAIncluding data transfer to/from CPU, whole FPGA inference process takes 2 msfor all 16k HCAL channels– Has been tested inside standard CMS software code environment, using highlevel trigger jobAlgorithmArchitecture Time/event(ms)Iterative fit procedure takes 50 msfor same inputsIterativefitCPU50FPGA inference is a fixed-latencyprocedure, iterative fit is 2– Inference alone takes 80 us (70 ns for one inference)Every event sends input features toFPGA, waits for callbackInference on CPU or GPU alsosignificantly faster than iterative fit–FPGA inference fastest33

/SDAccel Workshop Organized workshop for hls4ml and acceleration last week–How to do ultrafast DNN inference on FPGAs–https://indico.cern.ch/event/769727/Lots of interest in across many HEP experiments, industry– By the end of the course all participants were able to actually run inference on FPGAs– 90 participantsUsed AWS to provide machines for development/accelerationInterested in replicating workshop at other locations34

Workshop Program First part:–Understand the hls4ml package, its functionalities and design synthesis by runningwith one of the provided trained NN–Learn how to read out an estimate of FPGA resources and latency for a NN aftersynthesis–Learn how to optimize the design with quantization and parallelizationSecond part:– Third part:– Learn how to export the HLS design to firmware with SDAccelLearn how to do model compression and its effect on the FPGA resources/latencyFourth part:–Learn how to accelerate NN inference firmware on a real FPGA (provided onAmazon cloud) with SDAccel–Timing and resources studies after running on real FPGA35

Microsoft Brainwave Microsoft Brainwave is a CPUFPGA server farmML offered “as a service”– Send a preprocessed image toBrainwave, get prediction backfrom ResNet50Extremely interesting option forlonger latency ( 10 ms) systemsStay tuned for another talk in thefuture36

Outlook (1) Machine learning is becoming increasingly widely used for complexproblems– Some of the most challenging problems in HEP exist in the trigger– Not only in physics but also industryEven with a machine learning solution, still need to be able to performinference very fast FPGAshls4ml provides the ability to implement very low latencymachine learning solutions on FPGAs with a high degree ofcustomization–Can adjust configuration of implementation for desired resource usage,latency, and accuracy37

Outlook (2) Have utilized hls4ml already to enable fast machine learning solutions in CMS trigger– Ex. muon pT assignment, tau identification, othersImprovements in fast inference need not be limited to traditionally FPGA-based systems–Ex. large potential improvement in processing time for HCAL–Could envision using accelerator cards during offline processing or HLTUsage not restricted only to CMS trigger, many other possibilities for use of fast MLGrowing list of collaborators:– Jennifer Ngadiuba, Vladimir Loncar, Maurizio Pierini [CERN] Giuseppe Di Guglielmo[Columbia University] Javier Duarte, Burt Holzman, Sergo Jindariani, Ben Kreis, MiaLiu, Kevin Pedro, Ryan Rivera, Nhan Tran, Aristeidis Tsaris [Fermilab] Edward Kreinar[HawkEye 360] Sioni Summers [Imperial College London] Song Han, Phil Harris,Dylan Rankin [MIT] Zhenbin Wu [UIC] Scott Hauk, Shih-Chieh Hsu, Dustin Warren,Risha Rao [UW] Mark Neubauer, Markus Atkinson [UIUC]38

BACKUP39

Power Usage Reuse also improves power consumption40

Example Tuning: Reuse Can tune reuse factor inhls4ml configuration41

Example Tuning: Reuse Can tune reuse factor inhls4ml configuration42

SDAccel DataflowDRAMBRAMRegisters Multiple different memory management options43

Acceleration with Xilinx xDNN Have also investigated XilinxxDNN package for accelerationof large convolutional networks––– Connection between CPU andFPGA with PCI-express as beforeMajor latency comes in xDNNsetup (loading weights)Can batch inputs: allows reuse ofloaded weights, only costsadditional few ms per imageSome similarities with MicrosoftBrainwaveMajor difference is thatxDNN/AWS lacks an “as aservice” offeringUsing GoogLeNet v1Image preprocessing ( 10 ms)Depends on image sizeFull xDNN execution ( 400 ms)Includes setup (load weights, etc.)Data transfer ( 0.1 ms)Inference ( 3 ms)Data transfer ( 0.1 ms)Fully Connected Layer ( 2 ms)Softmax Output Layer ( 15 ms)44

2 Introduction Machine learning has become a common tool for broad spectrum of problems (industry & physics) - Particle/signal identification - Image/speech recognition Meanwhile, field-programmable gate arrays (FPGAs) have been used for decades to provide fast computing solutions - Development typically requires large initial investment (learning VHDL/Verilog,