CS 211 History Introduction To Explicitly Parallel Overview Of Key EPIC .

Transcription

EPIC Architectures OverviewCS 211Introduction to Explicitly ParallelInstruction Computing (EPIC)Architectures History Overview of key EPIC concepts– speculation, predication, register files Introduction to Compiler OptimizationBhagi NarahariCS 211The EPIC Concept- A little history The great CISC vs RISC debate– who won?VLIW and EPIC VLIW architectures progressed to EPIC A quick look at “pure” VLIW approach Intel’s “combination approach”– CISC instructions running on RISC core (486)– adding superscalar features (Pentium)– RISC proc had CISC like features for special units 64 bit RISC instructions– HP PA-RISC project, Early Intel efforts– iterative improvements but no revolution To make programs run faster need to lookat new architectures– something radical had to be done for more performanceCS 211CS 2111

What Is VLIW? VLIW hardware is simple andstraightforward, like SIMD machines. While SIMD broadcasts one instruction,VLIW separately directs eachfunctional unit add cutionCS 211FUFUFUadd r1,r2,r3load r4,r5 4FUFUmov r6,r2microsequencermicrocodestoredatapath controlnanocodestoremul r7,r8,r9FUFU A generation of high-performance, applicationspecific computers relied on horizontallymicroprogrammed computing engines.datapath controlCS 211Principles of VLIW Operation Statically scheduled ILP architecture. Wide instructions specify many independentsimple operations.VLIW Instruction100 - 1000 bitsMicrocode MemoryBitSliceALUBitSliceALUBitSliceALU Aggressive (but tedious) hand programming at themicrocode level provided performance well abovesequential processors.CS 211MacroInstructionsFUHorizontal Microcode and VLIWMicrosequencer(2910)Historical Perspective:Microcoding, nanocoding (and RISC) Multiple functional units executes all of theoperations in an instruction concurrently,providing fine-grain parallelism within eachinstruction Instructions directly control the hardware with nointerpretation and minimal decoding. A powerful optimizing compiler is responsible forlocating and extracting ILP from the program andCS 211for scheduling operations to exploit the availableparallel resources2

Formal VLIW Models Josh Fisher proposed the first VLIW machine at Yale(1983) Fisher’s Trace Scheduling algorithm for microcodecompaction could exploit more ILP than any existingprocessor could provide. The ELI-512 was to provide massive resources to asingle instruction stream––––Ideal Models for VLIW Machines Almost all VLIW research has been based uponan ideal processor model. This is primarily motivated by compiler algorithmdevelopers to simplify scheduling algorithms andcompiler data structures.– This model includes: Multiple universal functional units Single-cycle global register file16 processing clusters- multiple functional units/cluster.partial crossbar interconnect.multiple memory banks.attached processor – no I/O, no operating system.and often: Single-cycle execution Unrestricted, Multi-ported memory Multi-way branching Later VLIW models became increasingly more regularand sometimes:– Compiler complexity was a greater issue than originally envisionedCS 211 Unlimited resources (Functional units, registers, etc.)CS 211VLIW Execution Characteristics Unresolved design issuesGlobal Multi-Ported Register FileFunctionalUnitFunctionalUnitFunctionalUnitVLIW Design IssuesFunctionalUnit––––The best functional unit mixRegister file and interconnect topologyMemory system designBest instruction format Many questions could be answered throughexperimental researchInstructionMemorySequencerCondition CodesBasic VLIW architectures are a generalized form ofhorizontally microprogrammed machinesCS 211– Difficult - needs effective retargetable compilers Compatibility issues still limit interest ingeneral-purpose VLIW technologyHowever, VLIW may be the only way to build 8-16operation/cycle machines.CS 2113

Realistic VLIW DatapathNo Bypass!!No Stall!!Multi-Ported Register FileMulti-Ported Register FileFAdd(1 cycle)FMul4 cyc unpipeFMul4 cyc pipeFDiv16 cycleInstructionMemorySequencerCondition CodesCS 211Scheduling for Fine-Grain Parallelism The program is translated into primitive RISCstyle (three address) operations Dataflow analysis is used to derive an operationprecedence graph from a portion of the originalprogram Operations which are independent can bescheduled to execute concurrently contingentupon the availability of resources The compiler manipulates the precedence graphthrough a variety of semantic-preservingtransformations to expose additional parallelismCS 211ExampleA:B:C:D:e (a b) * (c d)b ;Original ProgramADr1 a br2 c de r1 * r2b b 13-Address CodeBVLIW List Scheduling Assign Priorities Compute Data Ready List - all operations whosepredecessors have been scheduled. Select from DRL in priority order while checkingresource constraints Add newly5 ready operations to DRL and repeat for1next instruction4-wide VLIWData Ready ListC2233Dependency Graph00:add a,b,r1add c,d,r2add b,1,b01:mul r1,r2,enopnopVLIW InstructionsCS 2113427110CS 12 10 11{10,11,12}13{13}134

Strengths of VLIW TechnologyEnabling Technologies for VLIW VLIW Architectures achieve highperformance through the combination of anumber of key enabling hardware andsoftware technologies.––––––Optimizing Schedulers (compilers)Static Branch PredictionSymbolic Memory DisambiguationPredicated Execution(Software) Speculative ExecutionProgram CompressionCS 211 Parallelism can be exploited at the instruction level– Available in both vectorizable and sequential programs. Hardware is regular and straightforward– Most hardware is in the datapath performing useful computations.– Instruction issue costs scale approximately linearlyPotentially very high clock rate Architecture is “Compiler Friendly”– Implementation is completely exposed - 0 layer of interpretation– Compile time information is easily propagated to run time. Exceptions and interrupts are easily managed Run-time behavior is highly predictable– Allows real-time applications.– Greater potential for code optimization.CS 211VLIW vs. Superscalar [Bob Rau, HP]Weaknesses of VLIW Technology No object code compatibility betweengenerations Program size is large (explicit NOPs)Multiflow machines predated “dynamic memory compression”by encoding NOPs in the instruction memory Compilers are extremely complex– Assembly code is almost impossible Philosophically incompatible with cachingtechniques VLIW memory systems can be very complex– Simple memory systems may provide very low performance– Program controlled multi-layer, multi-banked memory Parallelism is underutilized for some algorithms.CS rations/instructionInstruction streamparsingRun-time analysis ofregister dependenciesRun-time analysis ofmemory dependenciesRuntime instructionreorderingRuntime registerallocationCS terationframes)5

Real VLIW MachinesWhy VLIW Now? VLIW Multiflow TRACE 7/300, 14/300, 28/300 [Josh Fisher]Multiflow TRACE /500 [Bob Colwell]Cydrome Cydra 5 [Bob Rau]IBM Yorktown VLIW Computer (research machine)CPU Single-Chip VLIW Processors:DataCacheInstructionCache– Intel iWarp, Philip’s LIFE Chips (research) Single-Chip VLIW Media (through-put) Processors:(1MB)(1.5MB)– Trimedia, Chromatic, Micro-Unity16 IPCVLIW CPUInstructionCacheDataCache DSP Processors (TI TMS320C6x )1 Billion Transistor Intel/HP EPIC IA-64 (Explicitly Parallel InstructionComp.) Transmeta Crusoe (x86 on VLIW?) Sun MAJC (Microarchitecture for Java Computing)CS 211– ILP and complexity Better compilation technologyCS 211Performance Obstacles of Superscalars Branches– branch prediction helps, but penalty is still significant– limits scope of dynamic and static ILP analysis code motion Memory Load Latency– CPU speed increases at 60% per year- memory speed increases only 5% per year Memory Dependence– disambiguation is hard, both in hardware and software Sequential Execution Semantics ISAs– total ordering of all the instructions– implicit inter-instruction dependences1 Billion TransistorSuperscalar ProcessorProcessor Nonscalabilityof Superscalar VLIWProcessorIntel/HP EPIC/IA-64 Architecture EPIC (Explicitly Parallel InstructionComputing)– An ISA philosophy/approache.g. CISC, RISC, VLIW– Very closely related to but not the same as VLIW IA-64– An ISA definitione.g. IA-32 (was called x86), PA-RISC– Intel’s new 64-bit ISA– An EPIC type ISA Itanium (was code named Merced)Very expensive to implement wide dynamicsuperscalarsCS 211– A processor implementation of an ISA»e.g. P6, PA8500– The first implementation of the IA-64 ISACS 2116

IA-64 EPIC vs. Classic VLIWEPIC Concepts Similarities:––––Compiler generated wide instructionsStatic detection of dependenciesILP encoded in the binary (a group)Large number of architected registers Differences:– Instructions in a bundle can have dependencies– Hardware interlock between dependent instructions– Accommodates varying number of functional units andlatencies– Allows dynamic scheduling and functional unit bindingStatic scheduling are “suggestive” rather than absolute Code compatibility across generations Explicitly Parallel Instruction Computing– unlike early VLIW designs, EPIC does not use fixedwidth instructions.as many parallel as possible! Programs must be written usingsequential semantics– parallel semantics not supported– explicitly lay out the parallelism– eg: swapping of operandsbut software won’t run at top speed until it is recompiled so“shrink-wrap binary” might need to include multiple buildsCS 211CS 211EPIC: Key Concepts Speculation Predication (and parallel compares) Large (Rotating) Register FilesEPIC - Philosophy How do we increase performance– work harder or– work smarter To make the computer run faster– Principle 1: make it do each thing faster– Principle 2: make it do more things at once key to success of EPIC neither running fast nor juggling lots ofthings at once is simple!CS 211CS 2117

EPIC Concepts: Speculation What do you do with all the parallelism andhow– traditional problem has been that we never have enoughwork ready in order to keep a machine fully busy what happens when you stop worrying aboutonly doing things we must– if we have the power of parallelism, key is to not throw itaway– anytime processor is ready to do six things, do not give itonly two things to do and ignore ability to do more– how?CS 211EPIC Concepts: Speculation Speculatively ask machine to do morethings pick tasks that might be needed in future––––just aren’t sure whether they will be needed at the timemake sure you can determine if they will be neededextra tasks does not involve time (due to parallel units)even if they useful only 50% of the time, we havecompleted 50% of the tasks ahead of time! Promise of EPIC based on speculation– goal is to compute things before they are needed, sowhen program needs result it is already there!CS 211EPIC Concepts: PredicationEPIC Concepts: Predication Branching is generally bad because itinterferes with the ideal pipeline model ofreading instructions while earlier inst isexecuted ideally, if we eliminate branches then thisproblem disappears Predication is process by which branchesare eliminated Predication allows instructions to executein parallel, with some “on”and some “off”but without need for branchesCS 211CS 211– Every instruction written with a specified predicateregister to control whether instruction executes at runtime Ability to do “parallel compares”– ability to compute and combine comparison operationsin parallel– (A B) and (B 0) can be computed in parallel usingparallel compare instructions8

EPIC Concepts: PredicationEPIC Concepts: Predication EPIC provides predicated instructionsIf (a b) {x az 1} else {x bz z 1}– every instruction can be executed in predicated manner– instruction execution tied to result of a predicateregister– one predicate register hardwired to a 1; use this toalways executeCS 211CS 211EPIC Concepts: PredicationTest TRUE if (a b), else FALSEif (test is TRUE) tmp1 aif (test is FALSE) tmp1 bx tmp1if (test is TRUE) tmp2 1if (test is FALSE) tmp2 z 1z tmp2Predicationpr1 cmp .pr1 cmp .br X if pr1r1 r2 - r3 if pr1.TX:Fr1 r2 r3 if !pr1r1 r2 - r3r1 r2 r3.note: No branches above!CS 211CS 2119

Predication is notControl SpeculationPredicated Execution Each instruction can be separatelypredicated 64 one-bit predicate registers in IA-64An instruction is effectively a NOP if itspredicate is false Converts control flow into dataflowCS 211cmpbrelse1else2brthen1then2join1join2p1 p2 cmpelse1join1p1then2join2p2 Two type of speculation:– data speculation– control speculation In data speculation: loads are movedahead of stores (will be discussed later)p1then1p2else2 In control speculation: instructions aremoved from below a branch to above abranch– control speculation predicationCS 211Differences In predication:– compare instruction sets predicate registers– predicate registers used to confirm instructions andannul others– exceptions take place where they occurCS 211Differences In control speculation:– instructions below a branch is moved to above it andmarked as speculative– exceptions are postponed via register tagging– if speculation turns out to be false, result is discarded(register overwritten)– if speculation turns out to be true, must check whetherspeculative instructions caused exceptionsCS 21110

Control SpeculationControl Speculationr1 r2 r3 (sp)pr1 cmp .TX:.pr1 cmp .pr1 cmp .br X if pr1br X if pr1FFTr1 r2 - r3r1 r2 r3.X:r1 r2 r3 (sp).r1 r2 - r3. r1.non-specuse of r1cause tagcheckr1 isoverwrittenCS 211Note: If the speculation causeda delayed exception, then r1’sspeculation tag is set.However, in X, where thespeculation is unwanted, r1 issimply overwritten - r2 and r3being non-speculative willcause the speculation tag of r1to be cleared.r1 r2 - r3. r1.Speculative, Non-Faulting Loadr1 r2 r3 (sp)inst 1inst 2 .pr1 cmp .brbr X if pr1TCS 211FTX:CS 211Control SpeculationNote: In the block wherespeculation is wanted, r1 is asource. A source of a nonspeculative instruction with aspeculation tag set will causean exception!.br X if pr1X:ld.s r1 [a]inst 1inst 2 .brunsafecodemotionFr1 r2 - r3. r1. . ld r1 [a]use r1 .chk.s r1use r1ld r1 [a]ld.s fetches speculatively from memoryi.e. any exception due to ld.s is suppressedIf ld.s r did not cause an exception then chk.s r is an NOP, else abranch is taken (to some compensation code)CS 21111

Speculative, Non-Faulting Loadinst 1inst 2 .brbr . ld.s r1 [a]inst 1inst 2use r1 .brunsafecodemotionld r1 [a]use r1Speculative “Advanced” Load .chk.s useinst 1potentialinst 2aliasing .st [?]st[?] .ld r1 [x]use r1ld r1 [a]use r1Speculatively load data can be consumed prior to check“speculation” status is propagated with speculated dataAny instruction that uses a speculative result also becomes speculativeitself (i.e. suppressed exceptions)chk.s checks the entire dataflow sequence for exceptionsCS 211 ld.a r1 [x]inst 1inst 2 .st [?] .ld.c r1 [x]use r1ld.a starts the monitoring of any store to the same address as theadvanced loadIf no aliasing has occurred since ld.a, ld.c is a NOPIf aliasing has occurred, ld.c re-loads from memoryCS 211Using Speculative Load ResultsCompare Operations Comparison operations mayinst 1potentialinst 2aliasing .st [?]st[?] .ld r1 [x]use r1CS 211ld.a r1 [x]inst 1inst 2use r1 .st [?] .chk.a r1 .– set predicate registers (up to two simultaneously)– compare to a GPR Comparison operations themselves canbe predicatedld r1 [a]use r1 Predicate registers may be combined bylogical operationsCS 21112

Reducing Control HeightIf-conversion for Predication Identifying region of basic blocks based on resourcerequirement and profitability (branch mis-prediction rate,mis-prediction cost, and parallelism) Result: a predicated blockwith parallel compares Convert nested if’s into a single predicate Result: shorter control path by reducing the number ofbranchesa ba bs s ap1 0;;cmp.lt.or p1,p0 a,bcmp.lt.or p1,p0 b,c(p1) br s1;;s2NYcmp.lt p1,p2 a,b;;(p1) s s a(p2) s s b;;*p ss s bb cYNs1s2*p sCS 211CS 211Multiway Branch ExampleA B Use MultiwaybranchesB CXZY– Speculate compare (i.e.move above branch)– Do not reduce number ofbranchesEPIC Concepts: Rotating Register Setsand Software Pipelining Speculation and Predication require largesets of registers in EPIC In addition, concept of Rotating RegisterSets to support Software Pipelining– to help with execution of loops– extend pipeline conceptcmp.lt p1,p0 a,b(p1) br X;;cmp.lt p2,p0 b,c(p2) br Z;;cmp.lt p1,p0 a,bcmp.lt p2,p0 b,c(p1) br X(p2) br Z;;YYCS 211CS 21113

Softwware Pipelining: IllustrationSoftware Pipelining ExampleIteration21AFourIndependent BInstructions CDnLoopBodyConventional Sequential Execution of iterationsSOFTWAREPIPELINELoop BodynPrologueABCABA2DIterationsOverlapped Execution byPipelining iterationsCD1BCDANew Loop BodyILP 4BCDEpilogueLess time overallCS 211CS 211Register FilesStatic/RotatingEPIC Concepts: Software Pipelining Software pipeline (also known as ModuloScheduling) requires ‘register reuse’– instead provide Rotating Register Sets In IA-64 Registers organized into set of 32static registers and 96 rotating registers Each register file may have a static and a rotating portionThe ith static register in file F is named FiThe ith rotating register in file F is named F[i]F [i] FR [(RRB i) % size(FR)]F– R[1], R[2],.R[32]– we have four sets of above, allow same references infour different iterationsFSFRCS 211size(FR)CS 21114

BABAkernelAprologRotating Loop Frames for Loop PipeliningCBCABepilogCCCS 211Suppose Bi is is only datadependent (through datastored in registers) on Ai;and Ci only on Bi The “pipelined” kernelblock (containingindependent computationfrom Ci, Bi 1 and Ai 2 )potentially has better ILPWhat happens if Ci is alsodata dependent on Ai The result placed inregister by A getsclobbered by the nextexecution of A (in the nextcycle) before C can use ittwo cycles from nowNice Loop Pipelining Examplei 0while (i 99) {;; a[ i ] a[ i ]/10ARx a[ i ]Ry Rx / 10Ba[ i ] RyCi }i 0while (i 99) {Rx a[ i ]Ry Rx / 10a[ i ] Ry RxWARRx a[i 2]i i 3}Ai }a[97] Rya[98] Rx / 10i i 3}CS 211CRy Rx / 10 BRenaming with Rotating Registersi 0Ry a[ 0 ] / 10Rx a[ 1 ]while (i 97) {a[i] Ry Rx’while (i 97) {a[i] Ry Rx’Ry Rx / 10Ry Rx / 10pred(i -2 && i 98):Ry RR(x-1) / 10Rx’ RxRx a[i 2]Rx’ RxRx a[i 2]pred(i 97):RR(x) a[i 2]i increase RR offset by 1’i WARCS 211Rx a[ i 1 ]Ry Rx / 10a[ i 1 ] Ryi 0Ry a[ 0 ] / 10Rx a[ 1 ]Rx a[ i 1 ]Ry Rx / 10a[ i 1 ] Ry RxRx a[ i 2 ]Ry Rx / 10a[ i 2 ] Ry Rxwhile (i 97) {a[i] RyRx a[ i 2 ]Ry Rx / 10a[ i 2 ] RyLoop Pipelining Requiring Renamingi 0while (i 99) {;; a[ i ] a[ i ]/10 a[ i ]ARx a[ i ]BRy Rx / 10a[ i ] Ry RxCi }i 0Ry a[ 0 ] / 10Rx a[ 1 ]i 0while (i 99) {Rx a[ i ]Ry Rx / 10a[ i ] Ryi }i -2while (i 99) {pred(i -1):a[i] Ry RR(x-2)}}a[97] Ry Rx’a[98] Rx / 10 Rxa[97] Ry Rx’/ 10 RxCS 211 a[98] Rx15

register name R32 and upRegister Renamingreg nameR0 to R31physical registerR0 to R31physical register R32 to R127 128 general purpose physicalinteger registers Register names R0 to R31 arestatic and refer to the first 32physical GPRs Register names R32 to R127are known as “rotatingregisters” and are renamedonto the remaining 96 physicalregisters by an offsetoffset Remapping wraps around therotating registers such thatwhen offset is non-zero,physical location of R127 isjustCS 211below R32IA-64 Architecture 128 general-purpose registers128 floating-point registersArbitrary number of functional unitsArbitrary latencies on the functional unitsArbitrary number of memory portsArbitrary implementation of the memoryhierarchyNeeds retargetable compiler and recompilation toachieve maximum program performance ondifferent IA-64 implementationsCS 211IA-64 EPIC vs. Classic VLIWIA-64 Instruction Format Similarities: IA-64 “Bundle”– Total of 128 bits– Contains three IA-64 instructions (aka syllables)– Template bits in each bundle specify dependencies bothwithin a bundle as well as between sequential bundles– A collection of independent bundles forms a “group”A more efficient and flexible way to encode ILP then a fixedVLIW formatinst1inst2inst3temp IA-64 Instruction– Fixed-length 40 bits long– Contains three 7-bit register specifiers– Contains a 6-bit field for specifying one of the 64 one-bitpredicate registersCS 211––––Compiler generated wide instructionsStatic detection of dependenciesILP encoded in the binary (a group)Large number of architected registers Differences:– Instructions in a bundle can have dependencies– Hardware interlock between dependent instructions– Accommodates varying number of functional units andlatencies– Allows dynamic scheduling and functional unit bindingStatic scheduling are “suggestive” rather than absolute Code compatibility across generationsbut software won’t run at top speed until it is recompiled so“shrink-wrap binary” might need to include multiple buildsCS 21116

Cool Features of IA64 Predicated executionSpeculative, non-faulting Load instructionSoftware-assisted branch predictionRegister stackRotating register frameSoftware-assisted memory hierarchyMostly adapted from mechanisms that hadexisted for VLIWsCS 211Itanium Specifics 6-wide 10-stage pipelineFetch 2 bundles per cycle with the help of BP into a 8-bundle deepfetch queue512-entry 2-level BPT, 64-entry BTAC, 4 TAR, and a RSBIssue up to 2 bundles per cycle some mixes of 6 instructionse.g. (MFI,MFI) or (MIB,MIBh)Can issue as little as one syllable per cycle on RAW hazard interlockor structural hazard (scoreboard for RAW detection)8R-6W 128 Entry Int. GPR, 128 82-bit FPR, 64 predicate reg’s4 globally-bypassed single-cycle integer ALUs with MMX,2 FMACs, 2 LSUs, 3 BUsCan execute IA-32 software directly Intended for high-end server and workstationsYou can buy one now, finally. CS 211EPIC and Compiler OptimizationIntroduction toOptimizing CompilersCS 211 EPIC requires dependency free“scheduled code” Burden of extracting parallelism falls oncompiler success of EPIC architectures depends onefficiency of Compilers!! We provide overview of CompilerOptimization techniques (as they apply toEPIC/ILP)CS 21117

Structure of Optimizing CompilersTOOLSSource ProgramSource ProgramFrontendsProgramDatabaseFront-end #1High-levelOptimizerMiddleendFront-end #2 .High-level Intermediate LanguageHILOptimized HILLowering of ILLow-level Intermediate rTarget-1Code Generatorand LinkerTarget-2Code Generatorand LinkerTarget-1 ExecutableTarget-2 ExecutableOptimized LILTarget-3Code Generatorand Linker .Target-3 Executable18

A generation of high-performance, application-specific computers relied on horizontally microprogrammed computing engines. Aggressive (but tedious) hand programming at the microcode level provided performance well above sequential processors. Microsequencer (2910) Microcode Memory Bit Slice ALU Bit Slice ALU Bit Slice ALU CS 211 .