Vivado HLS Tutorial - Cornell University

Transcription

ECE 5775High-Level Digital Design AutomationFall 2018Vivado HLS TutorialSteve Dai, Sean Lai, Hanchen Jin,Zhiru ZhangSchool of Electrical and Computer Engineering

Agenda Logistics and questions Introduction to high-level synthesis––C-based synthesisCommon HLS optimizations Case study: FIR filter1

High-Level Synthesis (HLS) What–Automated design process that transforms a highlevel functional specification to optimizedregister-transfer level (RTL) descriptions for efficienthardware implementation Why–Productivity lower design complexity and faster simulation speed–Portability single source - multiple implementations–Permutability rapid design space exploration - higher quality of result (QoR)2

Permutability: Faster Design Space n2Throughputin3inin4 123412312REG out1out1Control-Data FlowGraph out1outout1 f (in1,in2,in3,in4)Untimedd add d setuptclk dadd dsetuptclkT1 1 / tclkT2 1/ (3* tclk )T3 1 / tclkA1 3 * AaddA2 Aadd 2 * AregA3 3 * Aadd 6 * Aregtclk3 d addCombinationalSequentialPipelined3

Hardware Specialization with HLS Data type specialization–arbitrary-precision fixed-point, custom floating-point Communication/interface specialization–streaming, memory-mapped I/O, etc. Memory specialization–array partitioning, data reuse, etc. Compute specialization–unrolling (ILP/DLP), pipelining (ILP/DLP/TLP), dataflow(TLP), multithreading (DLP/TLP)ILP/DLP/TLP: Instruction-/Data-/Task-level parallelism4

Typical C/C Synthesizable Subset Data types:–––––Primitive types: (u)char, (u)short , (u)int, (u)long, float,doubleArbitrary precision integer or fixed-point typesComposite types: array, struct, classTemplated types: template Statically determinable pointers No support for dynamic memory allocations No support for recursive function calls5

Typical C/C Constructs to RTL MappingC ConstructsHW t portsOperatorsàFunctional unitsScalarsàWires or registersArraysàMemoriesControl flowsàControl logics6

Function Hierarchy Each function is usually translated into an RTL module–Functions may be inlined to dissolve their hierarchySource codevoid A() { . body A . }void C() { . body C . }void B() {C();}RTL hierarchyTOPABCvoid TOP( ) {A( );B( );}7

Function Arguments Function arguments become ports on the RTLblocksTOPvoid TOP(int* in1, int* in2,int* out1){*out1 *in1 *in2;}in1in2in1 vldin2 vldDatapathFSMout1out1 vld Additional control ports are added to the design Input/output (I/O) protocols–Allow RTL blocks to automatically synchronize dataexchange8

Expressions HLS generates datapath circuits mostly fromexpressions–Timing constraints influence the degree of registeringA char A, B, C, D,int P;BP (A B)*C DC PD9

Arrays By default, an array in C code is typically implementedby a memory block in the RTL–Read & write array - RAM; Constant array - ROMvoid TOP(int){int A[N];for (i 0; i N; i )A[i x] A[i] i;}A[N]N-1N-2 10TOPA inRAMDINDOUTADDRA outCEWE An array can be partitioned and map to multiple RAMs Multiples arrays can be merged and map to one RAM An array can be partitioned into individual elements andmap to registers10

Loops By default, loops are rolled––Each loop iteration corresponds to a “sequence” ofstates (possibly a DAG)This state sequence will be repeated multiple timesbased on the loop trip countvoid TOP ( ) {.for (i 0; i N; i )b a[i];}TOPLD a[i]S1bS211

Loop Unrolling Loop unrolling to expose higherparallelism and achieve shorterlatency–Pros Decrease loop overhead Increase parallelism for scheduling–Cons Increase operation count, whichmay negatively impact area, power,and timingfor (int i 0; i N; i )A[i] C[i] D[i];A[0] C[0] D[0];A[1] C[1] D[1];A[2] C[2] D[2];.12

Loop Pipelining Loop pipelining is one of the most important optimizationsfor high-level synthesisAllows a new iteration to begin processing before the previousiteration is complete– Key metric: Initiation Interval (II) in # cycles–x[i]y[i]ldldfor (i 0; i N; i)p[i] x[i] * y[i]; stp[i]ld – Loadst – Storei 0i 1i 2i 3II 1ld stld stld stld stcycles13

Case Study:Finite Impulse Response (FIR) Filter14

Finite Impulse Response (FIR) Filter// original, non-optimized version of FIR#define SIZE 128#define N 10input signalvoid fir(int input[SIZE], int output[SIZE]) {output signal// FIR coefficientsint coeff[N] {13, -2, 9, 11, 26, 18, 95, -43, 6, 74};filter order// exact translation from FIR formula abovefor (int n 0; n SIZE; n ) {int acc 0;for (int i 0; i N; i ) {if (n - i 0)acc coeff[i] * input[n - i];}output[n] acc;}ith filter coefficient}15

Server Setup Log into ece-linux server––Host name: ecelinux.ece.cornell.eduUser name and password: [Your NetID credentials] Setup tools for this class–Source class setup script to setup Vivado HLS source /classes/ece5775/setup-ece5775.sh Test Vivado HLS–Open Vivado HLS interactive environment vivado hls -i–List the available commands help16

Copy FIR Example to Your Home Directory cd cp -r /classes/ece5775/FIR tutorial/ . ls Design files––fir.h: function prototypesfir *.c: function definitions Testbench files–fir-top.c: function used to test the design Synthesis configuration files–run.tcl: script for configuring and running Vivado HLS17

Project Tcl Script# # run.tcl for FIR# # open the HLS project fir.prjopen project fir.prj -reset# set the top-level function of the design to be firset top fir# add design and testbench filesadd files fir initial.cadd files -tb fir-top.c# do a c simulationcsim design# synthesize the designcsynth design# do a co-simulationcosim design# close project and quitclose project# exit Vivado HLSquitopen solution "solution1"# use Zynq deviceset part xc7z020clg484-1# target clock period is 10 nscreate clock -period 10You can use multiple Tclscripts to automate differentruns with differentconfigurations.18

Synthesize and Simulate the Design vivado hls -f run.tclGenerating csim.exe128/128 correct values!INFO: [SIM 211-1] CSim done with 0 errors.SW simulation only.Same as simply running asoftware program.INFO: [HLS 200-10] -------------INFO: [HLS 200-10] -- Scheduling module 'fir'INFO: [HLS 200-10] -------------INFO: [HLS 200-10] -------------INFO: [HLS 200-10] -- Exploring micro-architecture for module 'fir'INFO: [HLS 200-10] --------------HLSSynthesizing C to RTLINFO: [HLS 200-10] -------------INFO: [HLS 200-10] -- Generating RTL for module 'fir'INFO: [HLS 200-10] -------------INFO: [COSIM 212-47] Using XSIM for RTL simulation.INFO: [COSIM 212-14] Instrumenting C test bench .INFO: [COSIM 212-12] Generating RTL test bench .INFO: [COSIM 212-323] Starting verilog simulation.INFO: [COSIM 212-15] Starting XSIM .INFO: [COSIM 212-316] Starting C post checking .128/128 correct values!HW-SW co-simulation.SW test bench invokes RTLsimulation.INFO: [COSIM 212-1000] *** C/RTL co-simulation finished: PASS ***19

Synthesis Directory Structurevivado ystemcverilogvhdlRTL filesSynthesis reports of each functionin the design, except those inlined.20

Default Microarchitecturevoid fir(int input[SIZE], int output[SIZE]) {// FIR coefficientsint coeff[N] {13, -2, 9, 11, 26, 18, 95, -43, 6, 74};// Shift registersint shift reg[N] {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};// loop through each outputfor (int i 0; i SIZE; i ) {int acc 0;// shift registersfor (int j N - 1; j 0; j--) {shift reg[j] shift reg[j - 1];}// put the new input value into the first registershift reg[0] input[i];// do multiply-accumulate operationfor (j 0; j N; j ) {acc shift reg[j] * coeff[j];}output[i] acc;}}shift reg[0.9]coeff[0.9]xn acc ynPossible optimizationsLoop unrolling– Array partitioning– Pipelining–21

Unroll Loopsvoid fir(int input[SIZE], int output[SIZE]) { // loop through each outputfor (int i 0; i SIZE; i ) {int acc 0;// shift the registersfor (int j N - 1; j 0; j--) {#pragma HLS unrollshift reg[j] shift reg[j - 1];} // do multiply-accumulate operationfor (j 0; j N; j ) {#pragma HLS unrollacc shift reg[j] * coeff[j];} }// unrolled shift registersshift reg[9] shift reg[8];shift reg[8] shift reg[7];shift reg[7] shift reg[6]; shift reg[1] shift reg[0];// unrolled multiply-accumulateacc shift reg[0] * coeff[0];acc shift reg[1] * coeff[1];acc shift reg[2] * coeff[2]; acc shift reg[9] * coeff[9];}22

Microarchitecture after Unrollingshift reg[0.9]coeff[0.9]xn acc Defaultynshift reg[0]xnUnrolled coeff[1]coeff[0] coeff[2] coeff[8] coeff[9] yn23

Partition Arraysvoid fir(int input[SIZE], int output[SIZE]) {// FIR coefficientsint coeff[N] {13, -2, 9, 11, 26, 18, 95, -43, 6, 74};// Shift registersint shift reg[N] {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};#pragma HLS ARRAY PARTITION variable shift reg complete dim 0 }Complete array partitioning// Shift registersint shift reg 0 0;int shift reg 1 0;int shift reg 2 0; int shift reg 9 0;24

Microarchitecture after Partitioningshift reg[0]xnUnrolled coeff[1]coeff[0] xn coeff[2] coeff[8] shift reg[0] shift reg[1] coeff[1]coeff[0] coeff[2] coeff[9]shift reg[9] coeff[8] yn Unrolled Partitioned coeff[9] yn25

Pipeline Outer Loopvoid fir(int input[SIZE], int output[SIZE]) {Pipeline the entire outer loop // loop through each outputfor (int i 0; i SIZE; i ) {#pragma HLS pipeline II 1int acc 0;// shift the registersfor (int j N - 1; j 0; j--) {#pragma HLS unrollshift reg[j] shift reg[j - 1];} // do multiply-accumulate operationfor (j 0; j N; j ) {#pragma HLS unrollacc shift reg[j] * coeff[j];} // loop through each outputfor (int i 0; i SIZE; i ) {#pragma HLS pipeline II 1int acc 0; // put the new input value into the// first registershift reg[0] input[i]; }}}26

Fully Pipelined ImplementationPrevioussamplexn-1shift reg[0]shift reg[1] coeff[1] shift reg[0] coeff[8]coeff[2] xn coeff[0]Currentsampleshift reg[9]coeff[9] shift reg[1] coef[1]coeff[0]shift reg[9]coeff[8]coeff[2] Time27

Pipeline Outer Loopvoid fir(int input[SIZE], int output[SIZE]) { // loop through each outputfor (int i 0; i SIZE; i ) {#pragma HLS pipeline II 1int acc 0;// shift the registersfor (int j N - 1; j 0; j--) {#pragma HLS unrollshift reg[j] shift reg[j - 1];} // do multiply-accumulate operationfor (j 0; j N; j ) {#pragma HLS unrollacc shift reg[j] * coeff[j];} Pipeline the entire outer loopInner loops automaticallyunrolled when pipelining theouter loop}}28

Vivado HLS Tutorial Steve Dai, Sean Lai, HanchenJin, Zhiru Zhang School of Electrical and Computer Engineering ECE 5775 High-Level Digital Design Automation Fall 2018. Agenda Logistics and questions Introduction to high-level synthesis – C-based synthesis – Common HLS optimizations Case study: FIR filter 1 What – Automated design process that transforms a high-level functional .