4 Papers: All About Where To Draw Line Between HW And SW EECS 252 .

Transcription

Review from Last Time 4 papers: All about where to draw line between HW and SW IBM set foundations for ISAs since 1960sEECS 252 Graduate ComputerArchitecture–––––––Lec 7 – Instruction Level Parallelism B5000 very different model: HLL only, stack, Segmented VM IBM paper made case for ISAs good for microcodedprocessors leading to CISC Berkeley paper made the case for ISAs for pipeline cachemicrprocessors (VLSI) leading to RISC Who won RISC vs. CISC? VAX is dead. Intel 80x86 on desktop,RISC in embedded, Servers x86 and RISCDavid PattersonElectrical Engineering and Computer SciencesUniversity of California, Berkeleyhttp://www.eecs.berkeley.edu/ 06OutlineCS252 S06 Lec7 ILP2Recall from Pipelining Review ILPCompiler techniques to increase ILPLoop UnrollingStatic Branch PredictionDynamic Branch PredictionOvercoming Data Hazards with DynamicScheduling (Start) Tomasulo Algorithm Conclusion2/8/20068-bit byteByte-addressable memory (as opposed to word-addressable memory)32-bit wordsTwo's complement arithmetic (but not the first processor)32-bit (SP) / 64-bit (DP) Floating Point format and registersCommercial use of microcoded CPUsBinary compatibility / computer familyCS252 S06 Lec7 ILP Pipeline CPI Ideal pipeline CPI StructuralStalls Data Hazard Stalls Control Stalls– Ideal pipeline CPI: measure of the maximumperformance attainable by the implementation– Structural hazards: HW cannot support thiscombination of instructions– Data hazards: Instruction depends on result of priorinstruction still in the pipeline– Control hazards: Caused by delay between the fetchingof instructions and decisions about changes in controlflow (branches and jumps)32/8/2006CS252 S06 Lec7 ILP4

Instruction Level ParallelismInstruction-Level Parallelism (ILP) Basic Block (BB) ILP is quite small Instruction-Level Parallelism (ILP): overlap theexecution of instructions to improveperformance2 approaches to exploit ILP:– BB: a straight-line code sequence with no branches inexcept to the entry and no branches out except at the exit– average dynamic branch frequency 15% to 25% 4 to 7 instructions execute between a pair of branches– Plus instructions in BB likely to depend on each other1) Rely on hardware to help discover and exploit the parallelismdynamically (e.g., Pentium 4, AMD Opteron, IBM Power) , and2) Rely on software technology to find parallelism, statically atcompile-time (e.g., Itanium 2) To obtain substantial performanceenhancements, we must exploit ILP acrossmultiple basic blocks Simplest: loop-level parallelism to exploitparallelism among iterations of a loop. E.g.,for (i 1; i 1000; i i 1)x[i] x[i] y[i];Next 4 lectures on this topic2/8/2006CS252 S06 Lec7 ILP52/8/2006Loop-Level ParallelismCS252 S06 Lec7 ILP6Data Dependence and Hazards Exploit loop-level parallelism to parallelism by“unrolling loop” either by1. dynamic via branch prediction or2. static via loop unrolling by compiler(Another way is vectors, to be covered later) Determining instruction dependence is critical toLoop Level Parallelism If 2 instructions are– parallel, they can execute simultaneously in apipeline of arbitrary depth without causing anystalls (assuming no structural hazards)– dependent, they are not parallel and must beexecuted in order, although they may often bepartially overlapped2/8/2006CS252 S06 Lec7 ILP7 InstrJ is data dependent (aka true dependence) onInstrI:1. InstrJ tries to read operand before InstrI writes itI: add r1,r2,r3J: sub r4,r1,r32. or InstrJ is data dependent on InstrK which is dependent on InstrI If two instructions are data dependent, they cannotexecute simultaneously or be completely overlappedData dependence in instruction sequence data dependence in source code effect oforiginal data dependence must be preservedIf data dependence caused a hazard in pipeline,called a Read After Write (RAW) hazard2/8/2006CS252 S06 Lec7 ILP8

Name Dependence #1: Anti-dependenceILP and Data Dependencies,Hazards HW/SW must preserve program order:order instructions would execute in if executedsequentially as determined by original source program– Dependences are a property of programs Presence of dependence indicates potential for ahazard, but actual hazard and length of any stall isproperty of the pipeline Importance of the data dependencies1) indicates the possibility of a hazard2) determines order in which results must be calculated3) sets an upper bound on how much parallelism can possibly beexploited HW/SW goal: exploit parallelism by preserving programorder only where it affects the outcome of the program2/8/2006CS252 S06 Lec7 ILP9Name Dependence #2: Output dependence InstrJ writes operand before InstrI writes it.I: sub r1,r4,r3J: add r1,r2,r3K: mul r6,r1,r7 Called an “output dependence” by compiler writersThis also results from the reuse of name “r1” If anti-dependence caused a hazard in the pipeline,called a Write After Write (WAW) hazard Instructions involved in a name dependence canexecute simultaneously if name used in instructions ischanged so instructions do not conflict– Register renaming resolves name dependence for regs– Either by compiler or by HW2/8/2006CS252 S06 Lec7 ILP11 Name dependence: when 2 instructions use sameregister or memory location, called a name, but noflow of data between the instructions associatedwith that name; 2 versions of name dependence InstrJ writes operand before InstrI reads itI: sub r4,r1,r3J: add r1,r2,r3K: mul r6,r1,r7Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1” If anti-dependence caused a hazard in the pipeline,called a Write After Read (WAR) hazard2/8/2006CS252 S06 Lec7 ILP10Control Dependencies Every instruction is control dependent onsome set of branches, and, in general, thesecontrol dependencies must be preserved topreserve program orderif p1 {S1;};if p2 {S2;} S1 is control dependent on p1, and S2 iscontrol dependent on p2 but not on p1.2/8/2006CS252 S06 Lec7 ILP12

Control Dependence IgnoredException Behavior Control dependence need not bepreserved– willing to execute instructions that should not have beenexecuted, thereby violating the control dependences, ifcan do so without affecting correctness of the program Instead, 2 properties critical to programcorrectness are1) exception behavior and2) data flow Preserving exception behavior any changes in instruction execution ordermust not change how exceptions are raised inprogram( no new exceptions) Example:DADDUR2,R3,R4BEQZR2,L1LWR1,0(R2)L1:– (Assume branches not delayed) Problem with moving LW before BEQZ?2/8/2006CS252 S06 Lec7 ILP13Data FlowCS252 S06 Lec7 ILP14CS 252 Administrivia 1 Page project writeups Due this Sunday Data flow: actual flow of data values amonginstructions that produce results and those thatconsume them– students working on the RAMP project should go to 253 Cory or 387Soda to update their cardkey access for 125 Cory– RAMP Blue meeting today at 3:30 in 6th floor Soda Alcove– branches make flow dynamic, determine which instruction issupplier of data Reading Assignment: Chapter 2 today, Chapter 3 followingnext Wednesday Example:DADDUR1,R2,R3BEQZR4,LDSUBUR1,R5,R6L: ORR7,R1,R8 OR depends on DADDU or DSUBU?Must preserve data flow on execution2/8/20062/8/2006CS252 S06 Lec7 ILP– Try 30 minute discussion after one hour lecture (similar to ISAdiscussion)– Send email to TA by Friday, will be posted on Saturday, review beforediscussion on Monday Paper: “Limits of instruction-level parallelism,” by DavidWall, Nov 1993– Read pages 1-35 ( ½ of paper is figures)– In your comments, rank in order of importance alias analysis, branchprediction, jump prediction, register renaming, and speculativeexecution– In your comments, mention what are limits to this study of limits of ILP?152/8/2006CS252 S06 Lec7 ILP16

Computers in the NewsWho said this?OutlineA. Jimmy Carter, 1979B. Bill Clinton, 1996C. Al Gore, 2000D. George W. Bush, 2006 ILPCompiler techniques to increase ILPLoop UnrollingStatic Branch PredictionDynamic Branch PredictionOvercoming Data Hazards with DynamicScheduling (Start) Tomasulo Algorithm Conclusion"Again, I'd repeat to you that if we can remainthe most competitive nation in the world, it willbenefit the worker here in America. Peoplehave got to understand, when we talk aboutspending your taxpayers' money on researchand development, there is a correlating benefit,particularly to your children. See, it takes awhile for some of the investments that arebeing made with government dollars to cometo market. I don't know if people realize this,but the Internet began as the DefenseDepartment project to improvemilitary communications. In other words, wewere trying to figure out how to bettercommunicate, here was research money spent,and as a result of this sound investment, theInternet came to be.The Internet has changed us. It's changed thewhole world."CS252 S06 Lec7 ILP2/8/200617Software Techniques - Example2/8/2006CS252 S06 Lec7 ILP18FP Loop: Where are the Hazards? First translate into MIPS code:-To simplify, assume 8 is lowest address This code, add a scalar to a vector:for (i 1000; i 0; i i–1)x[i] x[i] s; Assume following latencies for all examples– Ignore delayed branch in these examplesInstructionproducing resultFP ALU opFP ALU opLoad doubleLoad doubleInteger op2/8/2006Instructionusing resultAnother FP ALU opStore doubleFP ALU opStore doubleInteger opLatencyin cycles43111CS252 S06 Lec7 ILPstalls betweenin cycles3210019Loop: L.DADD.DS.DDADDUIBNEZ2/8/2006F0,0(R1) ;F0 vector elementF4,F0,F2 ;add scalar from F20(R1),F4 ;store resultR1,R1,-8 ;decrement pointer 8B (DW)R1,Loop ;branch R1! zeroCS252 S06 Lec7 ILP20

FP Loop Showing Stalls1 Loop: nstructionproducing resultFP ALU opFP ALU opLoad double Revised FP Loop Minimizing StallsF0,0(R1) ;F0 vector element1 Loop: L.DF0,0(R1)2DADDUI R1,R1,-83ADD.D F4,F0,F24stallF4,F0,F2 ;add scalar in F20(R1),F4 ;store resultR1,R1,-8 ;decrement pointer 8B (DW);assumes can’t forward to branchR1,Loop ;branch R1! zeroInstructionusing resultAnother FP ALU opStore doubleFP ALU opCS252 S06 Lec7 F14,F2-24(R1),F16R1,R1,#-32R1,LOOP1 cycle stall2 cycles stall;drop DSUBUI &S.D8(R1),F4BNEZR1,Loop;altered offset when move DSUBUISwap DADDUI and S.D by changing address of S.DInstructionusing resultAnother FP ALU opStore doubleFP ALU opLatency inclock cycles3217 clock cycles, but just 3 for execution (L.D, ADD.D,S.D), 4 for loopoverhead; How make faster?21Unroll Loop Four Times(straightforward way)1 ducing resultFP ALU opFP ALU opLoad doubleLatency inclock cycles3219 clock cycles: Rewrite code to minimize stalls?2/8/20065672/8/2006CS252 S06 Lec7 ILP22Unrolled Loop DetailRewrite loop tominimize stalls?BNEZ;drop DSUBUI & BNEZ Do not usually know upper bound of loop Suppose it is n, and we would like to unroll theloop to make k copies of the body Instead of a single unrolled loop, we generate apair of consecutive loops:– 1st executes (n mod k) times and has a body that is theoriginal loop– 2nd is the unrolled body surrounded by an outer loop thatiterates (n/k) times;drop DSUBUI & BNEZ For large values of n, most of the execution timewill be spent in the unrolled loop;alter to 4*827 clock cycles, or 6.75 per iteration(Assumes R1 is multiple of 4)2/8/2006CS252 S06 Lec7 ILP232/8/2006CS252 S06 Lec7 ILP24

Unrolled Loop That Minimizes Stalls F12R1,R1,#321.2.3.4.8(R1),F16 ; 8-32 -24R1,LOOP 14 clock cycles, or 3.5 per iterationTransformation requires analyzing memory addresses and findingthat they do not refer to the same address5. Schedule the code, preserving any dependences neededto yield the same result as the original code25CS252 S06 Lec7 ILP2/8/2006263 Limits to Loop UnrollingStatic Branch Prediction1. Decrease in amount of overhead amortized witheach extra unrolling Lecture 3 showed scheduling code around delayedbranch To reorder code around branches, need to predictbranch statically when compile Simplest scheme is to predict a branch as takenAmdahl’s Law2. Growth in code sizeFor larger loops, concern it increases the instruction cachemiss rate– Average misprediction untaken branch frequency 34% SPEC25%22%2/8/2006CS252 S06 Lec7 ILP272/8/200612%11% 12%10%9%10%4%5%6%CS252 S06 Lec7 ILPIntegerrp2codljd2deardrohylicdoduc0%gcLoop unrolling reduces impact of branches onpipeline; another way is branch prediction15%15%sseqntotestpresso If not be possible to allocate all live values to registers, maylose some or all of its advantage18%20%pre More accuratescheme predictsbranches usingprofileinformationcollected fromearlier runs, andmodifypredictionbased on lastrun:m3. Register pressure: potential shortfall inregisters created by aggressive unrolling andschedulingMisprediction Rate co suCS252 S06 Lec7 ILP2/8/2006Requires understanding how one instruction depends onanother and how the instructions can be changed orreordered given the dependences:Determine loop unrolling useful by finding that loopiterations were independent (except for maintenance code)Use different registers to avoid unnecessary constraintsforced by using same registers for different computationsEliminate the extra test and branch instructions and adjustthe loop termination and iteration codeDetermine that loads and stores in unrolled loop can beinterchanged by observing that loads and stores fromdifferent iterations are independentm1 S.D11S.D12DSUBUI13S.D14BNEZ5 Loop Unrolling DecisionsFloating Point28

Dynamic Branch PredictionDynamic Branch Prediction Why does prediction work?– Underlying algorithm has regularities– Data that is being operated on has regularities– Instruction sequence has redundancies that are artifacts ofway that humans/compilers think about problems Performance ƒ(accuracy, cost of misprediction) Branch History Table: Lower bits of PC addressindex table of 1-bit values Is dynamic branch prediction better than staticbranch prediction?– Says whether or not branch taken last time– No address check– Seems to be– There are a small number of important branches in programswhich have dynamic behavior Problem: in a loop, 1-bit BHT will cause twomispredictions (avg is 9 iteratios before exit):– End of loop case, when it exits instead of looping as before– First time through loop on next time through code, when itpredicts exit instead of loopingCS252 S06 Lec7 ILP2/8/200629Dynamic Branch PredictionCS252 S06 Lec7 ILP2/8/200630BHT Accuracy Mispredict because either:– Wrong guess for that branch– Got branch history of wrong branch when index the table Solution: 2-bit scheme where change predictiononly if get misprediction twice 4096 entry table:CS252 S06 Lec7 ILP312/8/20065%9%9%9%5%1%CS252IntegerS06 Lec7 ILPFloating Pointnasa70%fpppmpatri x3002/8/200610%icedoducspice Red: stop, not takenNT Green: go, taken Adds hysteresis to decision making process12%liTNTPredict NotTakenspPredict NotTakenPredict Taken18%cTTNT20%18%16%14%12%10%8%6%4%2%0%gcPredict TakenMisprediction RateNTeqntotestpressoT32

Correlating BranchesCorrelated Branch Prediction Idea: record m most recently executed branchesas taken or not taken, and use that pattern toselect the proper n-bit branch history table(2,2) predictor– In general, (m,n) predictor means record last mbranches to select between 2m history tables,each with n-bit counters– Thus, old 2-bit BHT is a (0,2) predictorBehavior of recentbranches selectsbetween fourpredictions of nextbranch, updating justthat predictionBranch address42-bits per branch predictorPrediction Global Branch History: m-bit shift registerkeeping T/NT status of last m branches.2-bit global branch history Each entry in table has m n-bit predictors.CS252 S06 Lec7 ILP2/8/200633Accuracy of Different Schemes12% Use n-bit saturating counter to choose betweenpredictors11% Usual choice between global and local predictors10%8%6%6%6%6%5%5%4%4%2%1%1%4,096 entries: 2-bits per entry2/8/2006Unlimited entries: 2-bits/entryCS252 S06 Lec7 ucd0%0%tomcatvFrequency of Mispredictions14%34 Multilevel branch predictor4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT16%CS252 S06 Lec7 ILPTournament Predictors20%18%2/8/20061,024 entries (2,2)352/8/2006CS252 S06 Lec7 ILP36

Comparing Predictors (Fig. 2.8)Tournament Predictors Advantage of tournament predictor is ability toselect the right predictor for a particular branchTournament predictor using, say, 4K 2-bit countersindexed by local branch address. Choosesbetween:– Particularly crucial for integer benchmarks.– A typical tournament predictor will select the global predictoralmost 40% of the time for the SPEC integer benchmarks andless than 15% of the time for the SPEC FP benchmarks Global predictor– 4K entries index by history of last 12 branches (212 4K)– Each entry is a standard 2-bit predictor Local predictor– Local history table: 1024 10-bit entries recording last 10branches, index by branch address– The pattern of the last 10 occurrences of that particular branchused to index table of 1K entries with 3-bit saturating countersCS252 S06 Lec7 ILP2/8/200637Pentium 4 Misprediction Rate(per 1000 instructions, not per branch)1413Branch mispredictions per 1000 Instructions13Branch Target Buffers (BTB) 2% misprediction rate per branch SPECfp(5% of FP instructions are vpr176.gcc175.164.gzip02/8/2006CS252 S06 Lec7 ILP 6% misprediction rate per branch SPECint(19% of INT instructions are branch)1212112/8/2006SPECfp2000CS252 S06 Lec7 ILP39 Branch target calculation is costly and stalls theinstruction fetch. BTB stores PCs the same way as caches The PC of a branch is sent to the BTB When a match is found the correspondingPredicted PC is returned If the branch was predicted taken, instructionfetch continues at the returned predicted PC38

Dynamic Branch Prediction SummaryBranch Target Buffers Prediction becoming important part of execution Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlatedwith next branch– Either different branches (GA)– Or different executions of same branches (PA) Tournament predictors take insight to next level, byusing multiple predictors– usually one based on global information and one based on localinformation, and combining them with a selector– In 2006, tournament predictors using 30K bits are in processorslike the Power5 and Pentium 4 Branch Target Buffer: include branch address &prediction2/8/200642Advantages of Dynamic SchedulingOutline Dynamic scheduling - hardware rearranges theinstruction execution to reduce stalls whilemaintaining data flow and exception behavior It handles cases when dependences unknown atcompile time ILPCompiler techniques to increase ILPLoop UnrollingStatic Branch PredictionDynamic Branch PredictionOvercoming Data Hazards with DynamicScheduling (Start) Tomasulo Algorithm Conclusion2/8/2006CS252 S06 Lec7 ILPCS252 S06 Lec7 ILP– it allows the processor to tolerate unpredictable delays suchas cache misses, by executing other code while waiting forthe miss to resolve It allows code that compiled for one pipeline torun efficiently on a different pipeline It simplifies the compiler Hardware speculation, a technique withsignificant performance advantages, builds ondynamic scheduling (next lecture)432/8/2006CS252 S06 Lec7 ILP44

Dynamic Scheduling Step 1HW Schemes: Instruction Parallelism Key idea: Allow instructions behind stall to proceedDIVDADDDSUBDF0,F2,F4F10,F0,F8F12,F8,F14 Enables out-of-order execution and allows out-oforder completion (e.g., SUBD)– In a dynamically scheduled pipeline, all instructions still passthrough issue stage in order (in-order issue) Will distinguish when an instruction beginsexecution and when it completes execution; between2 times, the instruction is in execution Note: Dynamic execution creates WAR and WAWhazards and makes exceptions harder2/8/2006CS252 S06 Lec7 ILP45A Dynamic Algorithm: Tomasulo’scheckfor Read operands—Wait until no data hazards,then read operands2/8/2006CS252 S06 Lec7 ILP46 Control & buffers distributed with Function Units (FU)– FU buffers called “reservation stations”; have pending operands– Long memory latency Goal: High Performance without special compilers Small number of floating point registers (4 in 360)prevented interesting compiler scheduling of operations– This led Tomasulo to try to figure out how to get more effective registers— renaming in hardware! Why Study 1966 Computer? The descendants of this have flourished! Registers in instructions replaced by values or pointersto reservation stations(RS); called register renaming ;– Renaming avoids WAR, WAW hazards– More reservation stations than registers, so can do optimizationscompilers can’t Results to FU from RS, not through registers, overCommon Data Bus that broadcasts results to all FUs– Avoids RAW hazards by executing an instruction only when itsoperands are available Load and Stores treated as FUs with RSs as well Integer instructions can go past branches (predicttaken), allowing FP ops beyond basic block in FP queue– Alpha 21264, Pentium 4, AMD Opteron, Power 5, CS252 S06 Lec7 ILP Issue—Decodeinstructions,structural hazardsTomasulo Algorithm For IBM 360/91 (before caches!)2/8/2006 Simple pipeline had 1 stage to check bothstructural and data hazards: InstructionDecode (ID), also called Instruction Issue Split the ID pipe stage of simple 5-stagepipeline into 2 stages:472/8/2006CS252 S06 Lec7 ILP48

Tomasulo OrganizationReservation Station ComponentsFP RegistersFP OpQueueLoad BuffersFrom MemOp: Operation to perform in the unit (e.g., or –)Vj, Vk: Value of Source operandsLoad1Load2Load3Load4Load5Load6– Store buffers has V field, result to be storedQj, Qk: Reservation stations producing sourceregisters (value to be written)StoreBuffersAdd1Add2Add3– Store buffers only have Qi for RS producing resultMult1Mult2FPFP addersadders2/8/2006– Note: Qj,Qk 0 readyReservationStationsBusy: Indicates reservation station or FU is busyTo MemFPFP multipliersmultipliersCommon Data Bus (CDB)CS252 S06 Lec7 ILPRegister result status—Indicates which functional unitwill write each register, if one exists. Blank when nopending instructions that will write that register.49CS252 S06 Lec7 ILP2/8/2006Tomasulo ExampleInstruction streamThree Stages of Tomasulo AlgorithmInstruction status:InstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF61. Issue—get instruction from FP Op QueueIf reservation station free (no structural hazard),control issues instr & sends operands (renames registers).2. Execute—operate on operands (EX)When both operands ready then execute;if not ready, watch Common Data Bus for resultkR2R3F4F2F6F2Time Name BusyAdd1NoAdd2NoFU countAdd3NodownMult1 NoMult2 NoWrite on Common Data Bus to all awaiting units;mark reservation station available Normal data bus: data destination (“go to” bus) Common data bus: data source (“come from” bus)– 64 bits of data 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does the broadcastRegister result status:Clock0 Example speed:3 clocks for Fl .pt. ,-; 10 for * ; 40 clks for /CS252 S06 Lec7 ILPj34 45 F2F6F0F8Exec WriteIssue Comp ResultBusy AddressLoad1Load2Load3NoNoNo3 Load/BuffersReservation Stations:3. Write result—finish execution (WB)2/8/200650OpS1VjS2VkRSQjRSQk3 FP Adder R.S.2 FP Mult R.S.F0F2F4F6F8F10F12.FUClock cyclecounter512/8/2006CS252 S06 Lec7 ILP52F30

Tomasulo Example Cycle 1Instruction status:InstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF6j34 45 F2F6F0F8kR2R3F4F2F6F2Exec WriteIssue Comp ResultTime Name BusyAdd1NoAdd2NoAdd3NoMult1 NoMult2 NoRegister result 4F6F8FU1Busy Address1Reservation Stations:Tomasulo Example Cycle 2Instruction status:YesNoNoInstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF634 R2j34 45 F2F6F0F8kR2R3F4F2F6F2Exec WriteIssue Comp ResultReservation Stations:Time Name BusyAdd1NoAdd2NoAdd3NoMult1 NoMult2 NoF10F12.F30Load1Register result status:ClockOpF0FU2Busy 6F8YesYesNo34 R245 R3F10F12.F30Load1Note: Can have multiple loads outstandingCS252 S06 Lec7 ILP2/8/200653Tomasulo Example Cycle 3Instruction status:InstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF6j34 45 F2F6F0F8kR2R3F4F2F6F2Reservation Stations:Time Name Busy OpAdd1NoAdd2NoAdd3NoMult1 Yes MULTDMult2 NoRegister result status:Clock3FUF0Busy tionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF634 R245 R3Mult1 Load2F4F6F8F10F12.Load1CS252 S06 Lec7 ILPkR2R3F4F2F6F2Exec WriteIssue Comp Result1234344S1S2Busy AddressLoad1Load2Load3RSNoYesNo45 R3F10F12RSVjVkQjQkTime Name Busy OpAdd1 Yes SUBD M(A1)Load2Add2NoAdd3NoR(F4) Load2Mult1 Yes MULTDMult2 NoF30Register result status:Clock4 Note: registers names are removed (“renamed”) in ReservationStations; MULT issued Load1 completing; what is waiting for Load1?2/8/2006j34 45 F2F6F0F8Reservation Stations:RSQkR(F4) Load2F254Tomasulo Example Cycle 4Instruction status:Exec WriteIssue Comp Result123CS252 S06 Lec7 ILP2/8/200655FUF0F2Mult1 Load2F4F6F8.M(A1) Add1 Load2 completing; what is waiting for Load2?2/8/2006CS252 S06 Lec7 ILP56F30

Tomasulo Example Cycle 5Instruction status:InstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF6j34 45 F2F6F0F8kR2R3F4F2F6F2Exec WriteIssue Comp Result12345Reservation Stations:3445S1S2Busy AddressLoad1Load2Load3RSVjVkQjTime Name Busy Op2 Add1 Yes SUBD M(A1) M(A2)Add2NoAdd3No10 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVDM(A1) Mult1Register result status:ClockF0FU5F2F4Mult1 M(A2)Tomasulo Example Cycle 6Instruction status:F6InstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF6NoNoNoF8F10F12.F30M(A1) Add1 Mult2Register result status:Clockj34 45 F2F6F0F8kR2R3F4F2F6F2Reservation Stations:34S2Load1Load2Load3RSRegister result status:Clock7FUF0F2Mult1 ructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF6NoNoNoj34 45 F2F6F0F8kR2R3F4F2F6F2Exec WriteIssue Comp Result123456Reservation Stations:RSQkF858344578S1S2Busy AddressLoad1Load2Load3RSVjVkQjTime Name Busy OpAdd1No2 Add2 Yes ADDD (M-M) M(A2)Add3No7 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVDM(A1) Mult1F10F12.Add1 Mult2F30Register result status:Clock8FUF0F2Mult1 M(A2)F4F6NoNoNoRSQkF8F10F12.Add2 (M-M) Mult2 Add1 (SUBD) completing; what is waiting for it?2/8/2006CS252 S06 Lec7 ILPF30Add1 Mult2Tomasulo Example Cycle 87S1S1Busy AddressLoad1Load2Load3CS252 S06 Lec7 ILPInstruction status:VjVkQjTime Name Busy Op0 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDDM(A2) Add1Add3No8 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVDM(A1) Mult145Mult1 M(A2)2/8/2006Busy Address4534 Issue ADDD here despite name dependency on F6?57Exec WriteIssue Comp Result123456123456F0FU6Tomasulo Example Cycle 7InstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF6Exec WriteIssue Comp ResultVjVkQjTime Name Busy Op1 Add1 Yes SUBD M(A1) M(A2)Add2 Yes ADDDM(A2) Add1Add3No9 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVDM(A1) Mult1CS252 S06 Lec7 ILPInstruction status:kR2R3F4F2F6F2Reservation Stations:RSQk Timer starts down for Add1, Mult12/8/2006j34 45 F2F6F0F8592/8/2006CS252 S06 Lec7 ILP60F30

Tomasulo Example Cycle 9Instruction status:InstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF6j34 45 F2F6F0F8kR2R3F4F2F6F2Exec WriteIssue Comp Result123456Reservation Stations:Busy F8Time Name Busy OpAdd1No1 Add2 Yes ADDD (M-M) M(A2)Add3No6 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVDM(A1) Mult1Register result status:ClockF0FU9Mult1 M(A2)Tomasulo Example Cycle 10Instruction status:InstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF6NoNoNoj34 45 F2F6F0F8kR2R3F4F2F6F2Exec WriteIssue Comp Result123456Reservation Stations:344578Busy AddressLoad1Load2Load310S1VjS2F2F4RSVkQjTime Name Busy OpAdd1No0 Add2 Yes ADDD (M-M) M(A2)Add3No5 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVDM(A1) Mult1F10F12.F30Add2 (M-M) Mult2Register result status:ClockF0FU10NoNoNoF6Mult1 M(A2)RSQkF8F10F12.F30Add2 (M-M) Mult2 Add2 (ADDD) completing; what is waiting for it?CS252 S06 Lec7 ILP2/8/200661Tomasulo Example Cycle 11Instruction status:InstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF6j34 45 F2F6F0F8kR2R3F4F2F6F2Reservation Stations:3445781011S1S2Busy AddressLoad1Load2Load3RSVjVkQjTime Name Busy OpAdd1NoAdd2NoAdd3No4 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVDM(A1) Mult1Register result status:Clock11FUF0F2Mult1 M(A2)F4F6InstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF6NoNoNoj34 45 F2F6F0F8kR2R3F4F2F6F2Exec WriteIssue Comp Result123456Reservation Stations:RSQkF83445781011S1S2Busy AddressLoad1Load2Load3RSVjVkQjTime Name Busy OpAdd1NoAdd2NoAdd3No3 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVDM(A1) Mult1F1062Tomasulo Example Cycle 12Instruction status:Exec WriteIssue Comp Result123456CS252 S06 Lec7 ILP2/8/2006F12.(M-M M(M-M) Mult2F30Register result status:Clock12FUF0F2Mult1 M(A2)F4F6NoNoNoRSQkF8F10F12.(M-M M(M-M) Mult2 Write result of ADDD here? All quick instructions complete in this cycle!2/8/2006CS252 S06 Lec7 ILP632/8/2006CS252 S06 Lec7 ILP64F30

Tomasulo Example Cycle 13Instruction status:InstructionLDF6LDF2MULTD F0SUBDF8DIVDF10ADDDF6j34 45 F2F6F0F8kR2R3F4F2F6F2Exec WriteIssue Comp Result123456Reservation Stations:344578Busy Time Name Busy OpAdd1NoAdd2NoAdd3No2 Mult1 Yes MULTD M(A2) R(F4)Mult2 Yes DIVDM(A1) Mult1Register result status:ClockF0FU13Mult1 M(A2)Tomasulo Example Cycle 1

Instruction Level Parallelism Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: 1) Rely on hardware to help discover and exploit the parallelism dynamically (e.g., Pentium 4, AMD Opteron, IBM Power) , and 2) Rely on software technology to find parallelism .