Computer Architecture J. Daniel García Sánchez .

Transcription

Introduction to instruction level parallelismIntroduction to instruction level parallelismComputer ArchitectureJ. Daniel García Sánchez (coordinator)David Expósito SinghFrancisco Javier García BlasARCOS GroupComputer Science and Engineering DepartmentUniversity Carlos III of Madridcb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es1/66

Introduction to instruction level parallelismIntroduction to pipelining1Introduction to pipelining2Hazards3Multi-cycle operations4Conclusioncb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es2/66

Introduction to instruction level parallelismIntroduction to pipeliningPipelineImplementation technique to execute multiple instructionsoverlapped in time.A costly operation is divided into simple sub-operations.Sub-operations are executed into stages.Effects:Increases throughput.Latency is not decreased.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es3/66

Introduction to instruction level parallelismIntroduction to pipeliningPipelineIF1Cycle 2IF2ID1IF3ID2EX1Cycle 4IF4ID3EX2M1Cycle 5IF5ID4EX3M2WB1Cycle 6IF6ID5EX4M3WB2IF7ID6EX5M4WB3IF8ID7EX6M5WB4Cycle 7Cycle 8cb e d –NormalCycle 3Pipeline fillingCycle 1Latency:Computer Architecture–One instructionrequires 5 stages.5 cycles.Throughput:ARCOS Group–One instructionfinalized per cycle(once pipeline isfull).1 instruction percycle.http://www.arcos.inf.uc3m.es4/66

Introduction to instruction level parallelismIntroduction to pipeliningPipeline stagesSimplified model (MIPS):5 stages.More realistic models require more stagesStages:Instruction Fetch.Instruction Decode.Execution.Memory.Write-back.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es5/66

Introduction to instruction level parallelismIntroduction to pipeliningInstruction FetchNext PC4MUXSumSend PC to memory.cb e d –Update PC.IRMemoryPC ADDRRead instruction.Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es6/66

Introduction to instruction level parallelismIntroduction to pipeliningInstruction DecodeNext PCR1RDRegistersIRR2Decode instruction.MUXRead registers.ALUSign extend offsets.MUXInmcb e d –Compute possible branchaddress.SignExtComputer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es7/66

Introduction to instruction level parallelismIntroduction to pipeliningExecutionNext PCNext PCMUXR1CMPZeroNext PC or R1ALU operation onregisters.Alternatively, computeeffective address.Dir orDataALUInm or R2cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es8/66

Introduction to instruction level parallelismIntroduction to pipeliningMemoryDirDataMemoryRead from or write intomemory.DataDatacb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es9/66

Introduction to instruction level parallelismIntroduction to pipeliningWrite backData Mem.MUXWrite result into registerfile.RegsALUcb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es10/66

Introduction to instruction level parallelismIntroduction to pipeliningGeneral architectureMUX4Next PCSumCMPZeroR1RDMUXRegistersIRMemoryPC cb e d –Computer ArchitectureDecode–ARCOS sWrite-back11/66

Introduction to instruction level parallelismIntroduction to pipeliningPipeline effectsAn n depth pipeline multiplies by n the needed bandwidthcompared to a non-pipelined version with the same clockrate.Caching, caching, . . .Separation among data and instructions cachessuppresses some memory conflictsInstructions in the pipeline should not try to use the sameresource at the same time.Pipelining registers between successive stages.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es12/66

Introduction to instruction level parallelismIntroduction to pipeliningStages communicationNext PCNext 2RegistersIF/IDMemoryPC ADDRR1MUXMUXInm SignExtRDRD RDWB datacb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es13/66

Introduction to instruction level parallelismIntroduction to pipeliningPipeline over timeTime (cycles)Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle LUDMInstrIMemRegRegister read in second half of cycle.Register write in first half of cycle.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es14/66

Introduction to instruction level parallelismIntroduction to pipeliningExampleNon-pipelined processor.Clock cycle: 1 ns.40% ALU operations 4 cycles.20% branch operations 4 cycles.40% memory operations 5 cycles.Pipeline overhead 0.2 ns.Which is the pipeline speedup?torig cycleclock CPI 1ns (0.6 4 0.4 5) 4.4nstnew 1ns 0.2ns 1.2nsS cb e d –4.4ns 3.71.2nsComputer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es15/66

Introduction to instruction level parallelismHazards1Introduction to pipelining2Hazards3Multi-cycle operations4Conclusioncb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es16/66

Introduction to instruction level parallelismHazardsHazardA hazard is a situation preventing next instruction to startat the expected clock cycle.Hazards reduce performance in pipelined architectures.Types of hazards:Structural hazard.Data hazard.Control hazard.Simple approach to hazards:Stall the instruction flow.Already started instructions will continue.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es17/66

Introduction to instruction level parallelismHazardsStructural hazards2HazardsStructural hazardsData hazardsControl hazardscb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es18/66

Introduction to instruction level parallelismHazardsStructural hazardsStructural hazardHappens when hardware cannot support all possibleinstruction sequences.In the same cycle two pipeline stages need to use thesame resource.Reasons:Functional units that are not fully pipelined.Functional units which are not duplicated.These hazards can be avoided at the cost of a moreexpensive hardware.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es19/66

Introduction to instruction level parallelismHazardsStructural hazardsPipeline speedupSpeedup:tnonpipelined : Average instruction time in non-pipelinedarchitecture.tpipelined : Average instruction time in pipelined architecture.S tnonpipelinedCPInonpipelined cyclenonpipelined tpipelinedCPIpipelined cyclepipelinedIdeal case: pipelined CPI is 1.Need to add stall cycles per instruction.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es20/66

Introduction to instruction level parallelismHazardsStructural hazardsPipeline speedupIn case of non-pipelined processor:CPI 1, with cyclenonpipelined cyclepipelined .cyclenonpipelined N cyclepipelined .N Pipeline depth.SpeedupS cb e d –N1 stalls per instructionComputer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es21/66

Introduction to instruction level parallelismHazardsStructural hazardsStructural hazards: exampleTime (cycles)Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8LOADInstr egALUMemInstr 2Instr 3RegAssuming single port memory.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es22/66

Introduction to instruction level parallelismHazardsStructural hazardsStructural hazards: exampleTime (cycles)Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle llstallstallstallstallMemRegInstrInstr 1RegInstr 2Instr 3Instr 4Assuming single port memory.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es23/66

Introduction to instruction level parallelismHazardsStructural hazardsExampleTwo alternative designs:A: No structural hazards.Clock cycle 1nsB: With structural hazards.Clock cycle 0.9nsData access instructions with hazards 30%.Which one is the fastest alternative?tinst (A) CPI cycle 1 1ns 1nstinst (B) CPI cycle (0.7 1 0.3 (1 1)) 0.9ns 1.17nscb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es24/66

Introduction to instruction level parallelismHazardsData hazards2HazardsStructural hazardsData hazardsControl hazardscb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es25/66

Introduction to instruction level parallelismHazardsData hazardsData hazardA data hazard happens when the pipeline changes theread/write access to operands ordering.I2 reads R1 before than I1modifies it.ExampleI1 :I2 :I3 :I4 :I5 :dadddsubandorxorI3 reads R1 before than I1modifies it.I4 gets the right value. 1 , 2 , 3 4 , 1 , 5 6 , 1 , 7 8 , 1 , 9 10 , 1 , 11Register file is read insecond half of cycle.I5 gets the right value.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es26/66

Introduction to instruction level parallelismHazardsData hazardsData hazardsTime (cycles)Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8dadd 1, 2, 3dsub 4, 1, egALUMemRegMemRegALUMemand 6, 1, 7or 8, 1, 9xor 10, 1, 11cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.esReg27/66

Introduction to instruction level parallelismHazardsData hazardsStalls in data hazardsTime (cycles)Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8dadd 1, 2, 3dsub 4, 1, MemMemRegALUMemRegand 6, 1, 7or 8, 1, 9xor 10, 1, 11cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es28/66

Introduction to instruction level parallelismHazardsData hazardsData hazards: RAWRead After Write.Instruction i 1 tries to read a datum before instruction iwrites it.If there is a data dependency, instructions:ExampleCan neither be executed in parallel noroverlap.Instruction sub needs value from 1produced by instruction add.i:add 1, 2, 3i 1: sub 4, 1, 3Solutions:Hardware detection.Compiler control.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es29/66

Introduction to instruction level parallelismHazardsData hazardsData hazards: WARWrite After Read.Instruction i 1 modifies operand before instruction i readsit.Also known as anti-dependence incompiler technology.Examplei : sub 4, 1, 3i 1: add 1, 2, 3i 2: mul 6, 1, 7Name reuse.Instruction add modifies 1 before subreads it.Cannot happen in a MIPS with 5-stages pipeline.All instructions with 5 stages.Reads always happen in stage 2.Writes always happen in stage 5.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es30/66

Introduction to instruction level parallelismHazardsData hazardsData hazards: WAWWrite After Write.Instruction i 1 modifies operand before instruction imodifies it.Also known as output dependency incompiler technologyExamplei : sub 1, 4, 3i 1: add 1, 2, 3i 2: mul 6, 1, 7Name reuse.Instruction add modifies 1 before submodifies it.Cannot happen in a MIPS with 5-stages pipeline.All instructions with 5 stages.Writes always happen in stage 5.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es31/66

Introduction to instruction level parallelismHazardsData hazardsSolutions to data hazardsRAW dependencies:Forwarding.WAR and WAW dependencies:Register renaming.Done statically by compiler.Done dynamically by hardware.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es32/66

Introduction to instruction level parallelismHazardsData hazardsForwardingTechnique to avoid some data stalls.Basic idea:No need to wait until result is written into register file.Result is already in pipeline registers.Use that value instead of the one from the register file.Implementation:Results from EX and MEM stages written into ALU inputregisters.Forwarding logic selects between real input and forwardingregister.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es33/66

Introduction to instruction level parallelismHazardsData hazardsForwardingdadd 1, 2, RegIMRegALUDMdsub 4, 1, 5and 6, 1, 7or 8, 1, 9xor 10, 1, 11cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.esReg34/66

Introduction to instruction level parallelismHazardsData hazardsforwarding limitationsNot every hazard can be avoided with forwarding.You cannot travel backwards in time!ExampleI1 :I2 :I3 :I4 :I5 :lwdsubandorxorIM 1 , ( 0 ) 2 4 , 1 , 5 6 , 1 , 7 8 , 1 , 9 10 , 1 , 11RegALUDMRegIMRegALUDMRegIf hazard cannot be avoided, a stall must be introduced.cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es35/66

Introduction to instruction level parallelismHazardsData hazardsStalls in memory accesseslw 1, 0( gstallIMRegALUDMRegIMRegALUDMdsub 4, 1, 5and 6, 1, 7or 8, 1, 9xor 10, 1, 11cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.esReg36/66

Introduction to instruction level parallelismHazardsControl hazards2HazardsStructural hazardsData hazardsControl hazardscb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es37/66

Introduction to instruction level parallelismHazardsControl hazardsControl hazardA control hazard happens in a PC modification instruction.Terminology:Taken branch: If PC is modified.Not-taken branch: if PC is not modified.Problem:Pipelining assumes that branch will not be taken.What if, after ID, we find out that branch needs to be taken?cb e d –Computer Architecture–ARCOS Group–http://www.arcos.inf.uc3m.es38/66

Introduction to instruction level parallelismHazardsControl hazardsAlternatives in control hazardsCompile time: Fixed asumption for the full programexecution.Software ma

Instr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 LOAD Instr 1 Instr 2 Instr 3 Instr 4 Mem Reg ALU Mem Reg Mem Reg ALU Mem Reg Mem Reg ALU Mem Reg stall stall stall stall stall Mem Reg Assuming single port memory. – Computer Architecture – ARCOS Group – http://www.arcos.inf.uc3m.es 23/66