Lecture 9: Digital Signal Processors: Applications And .

Transcription

Lecture 9:Digital Signal Processors:Applications and ArchitecturesPrepared by: Professor Kurt KeutzerComputer Science 252, Spring 2000With contributions from:Dr. Jeff Bier, BDTI; Dr. Brock Barton, TI;Prof. Bob Brodersen, Prof. David PattersonKurt Keutzer1

Processor Applications Pentiums, Alpha’s, SPARC Used for general purpose software Heavy weight OS - UNIX, NT Workstations, PC’sEmbedded processors and processor coresARM, 486SX, Hitachi SH7000, NEC V800 Single program Lightweight, often realtime OS DSP support Cellular phones, consumer electronics (e.g. CD players)Increasingvolume IncreasingCostGeneral Purpose - high performanceMicrocontrollers Extremely cost sensitive Small word size - 8 bit common Highest volume processors by far Automobiles, toasters, thermostats, .Kurt Keutzer2

Processor Markets 30B32-bitmicro 1.2B/4%Kurt Keutzer 5.2B/17%32 bit DSPDSP 10B/33%16-bitmicro 5.7B/19%8-bitmicro 9.3B/31%3

PerformanceThe Processor Design SpaceApplication specificarchitecturesfor mance iseverything& Software rulesMicrocontrollersCost is everythingCostKurt Keutzer4

Market for DSP ProductsMixed/SignalAnalogDSPDSP is the fastest growing segment of the semiconductor marketKurt Keutzer5

DSP ApplicationsAudio applicationsNetworking MPEG Audio Cable modems Portable audio ADSLDigital cameras VDSLWireless Cellular telephones Base stationKurt Keutzer6

Another Look at DSP Applications Wireless Base Station - TMS320C6000 Cable modem gatewaysMid-end Cellular phone - TMS320C540 Fax/ voice serverIncreasingCostHigh-endLow endStorage products - TMS320C27 Digital camera - TMS320C5000 Portable phones Wireless headsets Consumer audio Automobiles, toasters, thermostats, .Kurt KeutzerIncreasingvolume 7

Serving a range of applicationsKurt Keutzer8

World’s Cellular SubscribersMillions700Will providea ubiquitousinfrastructurefor wirelessdata as wellas voice600500400300Digital200100Analog01993 1994 1995 1996 1997 1998 1999 2000 2001Kurt KeutzerYear9Source: Ericsson Radio Systems, Inc.

CELLULAR TELEPHONE SYSTEM1234567890PHYSICALLAYERPROCESSINGA/DKurt ONVERTERSPEECHDECODERFMODEMDAC10

HW/SW/IC EBASEBANDCONVERTERSPEECHDECODERFMODEMDACDSPANALOG ICKurt Keutzer11

Mapping onto a system on a chipS/PDMAS/PRAMRAMbookintfccontrol mentde-intl &RPE-LTPdecoderspeech decoderdemodulatorandsynchronizerKurt 2

Example Wireless Phone OrganizationC540ARM7Kurt Keutzer13

Multimedia I/O ArchitectureRadioModemEmbeddedProcessorSched ECC PactInterfaceLow Power BusFBFifoKurt Video14

Multimedia System on a ChipE.g. Multimedia terminal electronicsGraphics OutUplink RadioVideo I/ODownlink RadioVoice I/OPen InµPVideo UnitMemoryKurt KeutzerComsFuture chips will be a mix ofprocessors, memory anddedicated hardware for specificalgorithms and I/OcustomDSP15

Requirements of the EmbeddedProcessorsOptimized for a single program - code often in on-chip ROM or off chipEPROMMinimum code size (one of the motivations initially for Java)Performance obtained by optimizing datapathLow cost Lowest possible area Technology behind the leading edge High level of integration of peripherals (reduces system cost)Fast time to market Compatible architectures (e.g. ARM) allows reuseable code Customizable coreLow power if application requires portabilityKurt Keutzer16

Area of processor cores CostNintendo processorKurt KeutzerCellular phones17

Another figure of meritComputation per unit area?Kurt KeutzerNintendo processorCellular phones18

Code sizeIf a majority of the chip is the program stored in ROM, thencode size is a critical issueThe Piranha has 3 sized instructions - basic 2 byte, and 2 byte19plus 16 or 32 bit immediateKurt Keutzer

BENCHMARKS - DSPstoneZIVOJNOVIC, VERLADE, SCHLAGER: UNIVERSITY OF AACHENAPPLICATION BENCHMARKS ADPCM TRANSCODER - CCITT G.721 REAL UPDATE COMPLEX UPDATES DOT PRODUCT MATRIX 1X3 CONVOLUTION FIR FIR2DIM HR ONE BIQUAD LMS FFT INPUT SCALEDKurt Keutzer20

Evolution of GP and DSPGeneral Purpose Microprocessor traces roots back to Eckert,Mauchly, Von Neumann (ENIAC)DSP evolved from Analog Signal Processors, using analog hardwareto transform phyical signals (classical electrical engineering)ASP to DSP because DSP insensitive to environment (e.g., same response in snowor desert if it works at all)DSP performance identical even with variations in components;2 analog systems behavior varies even if built with samecomponents with 1% variationDifferent history and different applications led to different terms,different metrics, some new inventionsConvergence of markets will lead to architectural showdownKurt Keutzer21

Embedded Systems vs. General PurposeComputing - 1Embedded SystemGeneral purpose computingRuns a few applications oftenknown at design timeIntended to run a fully generalset of applicationsNot end-user programmableEnd-user programmableOperates in fixed run-timeconstraints, additionalperformance may not beuseful/valuableFaster is always betterKurt Keutzer22

Embedded Systems vs. General PurposeComputing - 2Embedded SystemGeneral purpose computingDifferentiating features:Differentiating features power cost speed (must bepredictable) speed did we mention speed? Kurt Keutzerspeed (need not be fullypredictable)cost (largest componentpower)23

DSP vs. General Purpose MPUDSPs tend to be written for 1 program, not many programs. Hence OSes are much simpler, there is no virtualmemory or protection, .DSPs sometimes run hard real-time apps You must account for anything that could happen in atime slotAll possible interrupts or exceptions must beaccounted for and their collective time be subtractedfrom the time interval.Therefore, exceptions are BAD!DSPs have an infinite continuous data streamKurt Keutzer24

DSP vs. General Purpose MPUThe “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate(MAC). DSP are judged by whether they can keep the multipliersbusy 100% of the time.The "SPEC" of DSPs is 4 algorithms: Inifinite Impule Response (IIR) filters Finite Impule Response (FIR) filters FFT, and convolversIn DSPs, algorithms are king! Binary compatability not an issueSoftware is not (yet) king in DSPs. Kurt KeutzerPeople still write in assembly language for a product tominimize the die area for ROM in the DSP chip.25

TYPES OF DSP PROCESSORSDSP Multiprocessors on a die TMS320C80 TMS320C600032-BIT FLOATING POINT TI TMS320C4X MOTOROLA 96000 AT&T DSP32C ANALOG DEVICES ADSP2100016-BIT FIXED POINT TI TMS320C2X MOTOROLA 56000 AT&T DSP16 ANALOG DEVICES ADSP2100Kurt Keutzer26

Note of Caution on DSP ArchitecturesSuccessful DSP architectures have two aspects: Key architectural and micro-architectural featuresthat enabled product success in key parameters Speed Code density Low powerArchitectural and micro-architectural features thatare artifacts of the era in which they were designed We will focus on the former!Kurt Keutzer27

Architectural Features of DSPsData path configured for DSP Fixed-point arithmetic MAC- Multiply-accumulateMultiple memory banks and buses Harvard Architecture Multiple data memoriesSpecialized addressing modes Bit-reversed addressing Circular buffersSpecialized instruction set and execution control Zero-overhead loops Support for MACSpecialized peripherals for DSPTHE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN!!!Kurt Keutzer28

DSP Data Path: ArithmeticDSPs dealing with numbers representing real world Want “reals”/ fractionsDSPs dealing with numbers for addresses Want integersSupport “fixed point” as well as integers.-1 Š x 1SradixpointS.radixpointKurt Keutzer–2N–1 Š x 2N–129

DSP Data Path: PrecisionWord size affects precision of fixed point numbersDSPs have 16-bit, 20-bit, or 24-bit data wordsFloating Point DSPs cost 2X - 4X vs. fixed point, slower than fixedpointDSP programmers will scale values inside code SW Libraries Separate explicit exponent“Blocked Floating Point” single exponent for a group of fractionsFloating point support simplify developmentKurt Keutzer30

DSP Data Path: Overflow?DSP are descended from analog :what should happen to output when “peg” an input?(e.g., turn up volume control knob on stereo) Modulo Arithmetic?Set to most positive (2N–1–1) ormost negative value(–2N–1) : “saturation”Many algorithms were developed in this modelKurt Keutzer31

DSP Data Path: MultiplierSpecialized hardware performs all key arithmeticoperations in 1 cycle50% of instructions can involve multiplier single cycle latency multiplierNeed to perform multiply-accumulate (MAC)n-bit multiplier 2n-bit productKurt Keutzer32

DSP Data Path: AccumulatorDon’t want overflow or have to scale accumulatorOption 1: accumalator wider than product:“guard bits” Motorola DSP:24b x 24b 48b product, 56b AccumulatorOption 2: shift right and round product before adderMultiplierMultiplierShiftALUAccumulator GKurt KeutzerALUAccumulator33

DSP Data Path: RoundingEven with guard bits, will need to round when storeaccumulator into memory3 DSP standard optionsTruncation: chop results biases results upRound to nearest: 1/2 round down, 1/2 round up (more positive) smaller biasConvergent: 1/2 round down, 1/2 round up (more positive), 1/2round to make lsb a zero ( 1 if 1, 0 if 0) no biasIEEE 754 calls this round to nearest evenKurt Keutzer34

Data PathDSP ProcessorGeneral-Purpose ProcessorSpecialized hardware performsall key arithmetic operations in1 cycle.Multiplies often take 1 cycleHardware support formanaging numeric fidelity: Shifters Guard bits SaturationKurt KeutzerShifts often take 1 cycleOther operations (e.g.,saturation, rounding) typicallytake multiple cycles.35

320C54x DSP Functional Block DiagramKurt Keutzer36

FIR Filtering:A Motivating ProblemM most recent samples in the delay line (Xi)New sample moves data down delay line“Tap” is a multiply-addEach tap (M 1 taps total) nominally requires: Two data fetches Multiply Accumulate Memory write-back to update delay lineGoal: 1 FIR Tap / DSP instruction cycleKurt Keutzer37

BENCHMARKS - FIR FILTERFINITE-IMPULSE RESPONSE FILTERZ 1C1Kurt KeutzerZ 1C2Z 1.C N 1CN38

Micro-architectural impact - MACN 1y(n) åh(m)x(n m)0element of finite-impulseresponse filter computationXYMPYADD/SUBACC REGKurt Keutzer39

Mapping of the filter onto a DSP execution unit13XnX2β αY5ΣXn-146Yn4612Dα5D3The critical hardware unit in a DSP is the multiplier - much of thearchitecture is organized around allowing use of the multiplieron every cycleThis means providing two operands on every cycle, throughmultiple data and address busses, multiple address units andKurt Keutzerlocal accumulator feedback40

MAC Eg. - 320C54x DSP Functional Block DiagramKurt Keutzer41

DSP MemoryFIR Tap implies multiple memory accessesDSPs want multiple data portsSome DSPs have ad hoc techniques to reduce memorybandwdith demand Instruction repeat buffer: do 1 instruction 256 timesOften disables interrupts, thereby increasing interruptresponse timeSome recent DSPs have instruction caches Even then may allow programmer to “lock in”instructions into cacheOption to turn cache into fast program memoryNo DSPs have data cachesMay have multiple data memoriesKurt Keutzer42

Conventional Von Neumann’’ memoryKurt Keutzer43

HARVARD ARCHITECTURE in DSPPROGRAMMEMORYX MEMORYY MEMORYGLOBALP DATAX DATAY DATAKurt Keutzer44

Memory ArchitectureDSP ProcessorGeneral-Purpose ProcessorHarvard architectureVon Neumann architecture2-4 memory accesses/cycleTypically 1 access/cycleNo caches-on-chip SRAMMay use oryKurt Keutzer45

Eg. TMS320C3x MEMORY BLOCK DIAGRAM - Harvard ArchitectureKurt Keutzer46

Eg. 320C62x/67x DSPKurt Keutzer47

DSP AddressingHave standard addressing modes: immediate, displacement,register indirectWant to keep MAC datapth busyAssumption: any extra instructions imply clock cycles ofoverhead in inner loop complex addressing is good don’t use datapath to calculate fancy addressAutoincrement/Autodecrement register indirectKurt Keutzer lw r1,0(r2) r1 - M[r2]; r2 -r2 1 Option to do it before addressing, positive or negative48

DSP Addressing: FFTFFTs start or end with data in weird bufferfly order0 (000) 0 (000)1 (001) 4 (100)2 (010) 2 (010)3 (011) 6 (110)4 (100) 1 (001)5 (101) 5 (101)6 (110) 3 (011)7 (111) 7 (111)What can do to avoid overhead of address checking instructions for FFT?Have an optional “bit reverse” address addressing mode for use withautoincrement addressingMany DSPs have “bit reverse” addressing for radix-2 FFTKurt Keutzer49

BIT REVERSED ur 2-pointDFTsTwo 4-pointDFTsOne 8-point DFTData flow in the radix-2 decimation-in-time FFT algorithmKurt Keutzer50

DSP Addressing: BuffersDSPs dealing with continuous I/OOften interact with an I/O buffer (delay lines)To save memory, buffer often organized as circular bufferWhat can do to avoid overhead of address checkinginstructions for circular buffer?Option 1: Keep start register and end register per addressregister for use with autoincrement addressing, reset tostart when reach end of bufferOption 2: Keep a buffer length register, assuming buffersstarts on aligned address, reset to start when reach endEvery DSP has “modulo” or “circular” addressingKurt Keutzer51

CIRCULAR BUFFERSInstructions accomodate threeelements: buffer address buffer size incrementAllows for cyling through: delay elements coefficients in data memoryKurt Keutzer52

AddressingDSP ProcessorGenera

Digital Signal Processors: Applications and Architectures Prepared by: Professor Kurt Keutzer Computer Science 252, Spring 2000 With contributions from: Dr. Jeff Bier, BDTI; Dr. Brock Barton, TI; Prof. Bob Brodersen, Prof. David Patterson . 2 Kurt Keutzer Processor Applications General Purpose - high performance Pentiums, Alpha’s, SPARC Used for general purpose software Heavy weight OS .