CS533: Data Ow Architectures (Tom Shull Presenting)

1y ago

21 Views

1 Downloads

881.09 KB

90 Pages

Report/dmca

Download PDF

Transcription

CS533: Dataflow Architectures(Tom Shull Presenting)Josep TorrellasUniversity of Illinois in Urbana-ChampaignApril 28, 2015Josep Torrellas (UIUC)CS533April 28, 20151 / 27

MotivationWhat is a Von Neumann architecture?Josep Torrellas (UIUC)CS533April 28, 20152 / 27

MotivationWhat is a Von Neumann architecture?the traditional processor with register, pc (control unit), memory, aluJosep Torrellas (UIUC)CS533April 28, 20152 / 27

MotivationWhat is a Von Neumann architecture?the traditional processor with register, pc (control unit), memory, aluWhy is this design problematic for parallel systems?Josep Torrellas (UIUC)CS533April 28, 20152 / 27

MotivationWhat is a Von Neumann architecture?the traditional processor with register, pc (control unit), memory, aluWhy is this design problematic for parallel systems?instructions can stall the pipeline, yet there still might be somenon-dependent instructions ready to executeit is hard to context switch (with a large number of threads) –information is stored in the registers, control unit that is specific to acertain contextJosep Torrellas (UIUC)CS533April 28, 20152 / 27

MotivationIdeally, in what scenarios does the processor have to stall?Josep Torrellas (UIUC)CS533April 28, 20153 / 27

MotivationIdeally, in what scenarios does the processor have to stall?has finished executingcommunication latencyalu operation latencyJosep Torrellas (UIUC)CS533April 28, 20153 / 27

MotivationIdeally, in what scenarios does the processor have to stall?has finished executingcommunication latencyalu operation latencyIn other words, a processor should only stall when given the currentdata state it is impossible to execute any program instructionsJosep Torrellas (UIUC)CS533April 28, 20153 / 27

Dataflow GraphsIn compilers, often dataflow graphs are used to represent programsJosep Torrellas (UIUC)CS533April 28, 20154 / 27

Dataflow GraphsIn compilers, often dataflow graphs are used to represent programsIs a graph where instructions are nodes and edges representdependences between instructionsWhy do compilers use this representation? Why is this representationadvantageous for parallel programs?Josep Torrellas (UIUC)CS533April 28, 20154 / 27

Fine Grain ParallelismIs there a fundamental reason to have a few threads each executing alarge number of instructions?Josep Torrellas (UIUC)CS533April 28, 20155 / 27

Fine Grain ParallelismIs there a fundamental reason to have a few threads each executing alarge number of instructions?Are there practical reasons for having a few threads?Josep Torrellas (UIUC)CS533April 28, 20155 / 27

Fine Grain ParallelismIs there a fundamental reason to have a few threads each executing alarge number of instructions?Are there practical reasons for having a few threads?context switching/management overheadsynchronizationJosep Torrellas (UIUC)CS533April 28, 20155 / 27

Data-Driven Executionwhat is data-driven execution?Josep Torrellas (UIUC)CS533April 28, 20156 / 27

Data-Driven Executionwhat is data-driven execution?instead of executing instructions according to a program counter,execute instructions as soon as their operands are availableThis is nicely related to the dataflow graph technique mentionedearlierWhat are advantages of this data driven approach?more concurrency is exposed any “ready” instruction can bescheduled to execute nextsynchronization is handled implicitly through the data dependences.Josep Torrellas (UIUC)CS533April 28, 20156 / 27

Data-Driven ExeuctionWhat types of workloads benefit most from a data-driven approach?Are there other ways to approaches to achieve speedup on theseworkloadsJosep Torrellas (UIUC)CS533April 28, 20157 / 27

Data-Driven ExeuctionWhat types of workloads benefit most from a data-driven approach?Are there other ways to approaches to achieve speedup on theseworkloadsWhile SIMD machines are effective for regular parallelism, data-drivenmachines can also theoretically work well with irregular parallelismHow does the data-driven design compare with ASICs/accelerators?Josep Torrellas (UIUC)CS533April 28, 20157 / 27

IntroductionFirst articulated by Jack Dennis at MIT in 1969Josep Torrellas (UIUC)CS533April 28, 20158 / 27

IntroductionFirst articulated by Jack Dennis at MIT in 1969In dataflow machines, hardware is optimized for fine-grain data-drivenparallel computationJosep Torrellas (UIUC)CS533April 28, 20158 / 27

IntroductionFirst articulated by Jack Dennis at MIT in 1969In dataflow machines, hardware is optimized for fine-grain data-drivenparallel computationData-driven computation is intuitively appealing because:Only data dependences constrain parallelismPrograms usually represented as graphsxy a x yb a*ac 4-a4Josep Torrellas (UIUC)a- cbCS533April 28, 20158 / 27

IntroductionFirst articulated by Jack Dennis at MIT in 1969In dataflow machines, hardware is optimized for fine-grain data-drivenparallel computationData-driven computation is intuitively appealing because:Only data dependences constrain parallelismPrograms usually represented as graphsxy a x yb a*ac 4-a4a- cbAlthough around for 40 years, little impact on mainstream computingHowever, ideas used in OoO superscalarsJosep Torrellas (UIUC)CS533April 28, 20158 / 27

TerminologyInstructions are nodes (e.g. , -, . . . )Josep Torrellas (UIUC)CS533April 28, 20159 / 27

TerminologyInstructions are nodes (e.g. , -, . . . )Data items are tokens (e.g. 8.2, true, 4, . . . )Josep Torrellas (UIUC)CS533April 28, 20159 / 27

TerminologyInstructions are nodes (e.g. , -, . . . )Data items are tokens (e.g. 8.2, true, 4, . . . )Producers of tokens are connected to consumers by arcsJosep Torrellas (UIUC)CS533April 28, 20159 / 27

TerminologyInstructions are nodes (e.g. , -, . . . )Data items are tokens (e.g. 8.2, true, 4, . . . )Producers of tokens are connected to consumers by arcsEntry points to nodes are input portsA nodes is enabled by an enabling rule. This typically is when ALLarcs have tokens on their inputs (strict enabling rule)non-strict allows firing before all arcs have tokensJosep Torrellas (UIUC)CS533April 28, 20159 / 27

Node TypesJosep Torrellas (UIUC)CS533April 28, 201510 / 27

Node TypesFunctional: Output depends only on inputs and node type (e.g. , -,.)Josep Torrellas (UIUC)CS533April 28, 201510 / 27

Node TypesFunctional: Output depends only on inputs and node type (e.g. , -,.)Conditional: Depending on control, token picked up from value inputis passed onto true-output arc or false-output arc.Merge: Whenever there is a token on any of the input arcs, it isimmediately copied to the output arc.non-strict firing ruleacts as serializer/mergervaluevalue 1value 2controlvalue outtruefalseMergeConditionalJosep Torrellas (UIUC)CS533April 28, 201510 / 27

ExampleIf StatementJosep Torrellas (UIUC)CS533April 28, 201511 / 27

ExampleIf StatementDo-While loopJosep Torrellas (UIUC)CS533April 28, 201511 / 27

Programming Dataflow MachinesDataflow Machines need a different ISA than traditional architectures.Josep Torrellas (UIUC)CS533April 28, 201512 / 27

Programming Dataflow MachinesDataflow Machines need a different ISA than traditional architectures.Whose responsiblity is it to create machine code for these machines?Josep Torrellas (UIUC)CS533April 28, 201512 / 27

Programming Dataflow MachinesDataflow Machines need a different ISA than traditional architectures.Whose responsiblity is it to create machine code for these machines?open research topic(?)Many project about using high-level languages to generate RTL, howto make programmable acceleratorsJosep Torrellas (UIUC)CS533April 28, 201512 / 27

Iteration, Recursion, Reuse and ReentrancyUsing the same graph to perform computation on different sets ofdata can cause problemsJosep Torrellas (UIUC)CS533April 28, 201513 / 27

Iteration, Recursion, Reuse and ReentrancyUsing the same graph to perform computation on different sets ofdata can cause problemsIt is easy to imagine scenarios where data is lost or computationbetween many iterations is jumbled so that corrupted data is generatedIn the following, we assume that things mess up if a token arives atan arc while another token is still pending. This happens if eachnodes has a token buffer that is only one deep . . . when the secondtoken arrives, the old one gets overwritten or it is unclear which tokento use while executing.Josep Torrellas (UIUC)CS533April 28, 201513 / 27

Iteration, Recursion, Reuse and ReentrancyDataflow Machine Architecturel373h is some arbitrary subgrapho,-exYx: y: 0while x 10dox: x 1y : y h(x)odA second token x 2 canarrive at the input offunction h if it takes a longtime to compute Lit,compared to thenodeTwo general ways ofhandling the problem giverise to the classification ofdataflow machines intoStaticDynamicFigure 9. An unsafe way to implement a loop. A newtoken may arrive at the input of subgraph h beforethe previous one is absorbed.Josepenableda Torrellassecond (UIUC)time after all tokens of aCS533April 28, 201514 / 27

Static Dataflow MachinesUse locks: compound BRANCH & MERGE nodesxx 10yxyx 10loopxexityloopexitx yxy(exit)The strict firing rule prevents x from going ahead until y arrivesDownside: reduces concurrencyJosep Torrellas (UIUC)CS533April 28, 201515 / 27

Static Dataflow Machines (II)Alternative: use Acknowledge method: add extra ack arcs from consumerto producer nodeslike handshake protocols in asynchronous systemsJosep Torrellas (UIUC)CS533April 28, 201516 / 27

Static Dataflow Machines (II)Alternative: use Acknowledge method: add extra ack arcs from consumerto producer nodeslike handshake protocols in asynchronous systemsguarantees that only one node is in an arc at a timeAllows greater concurrency than lock method, but at cost of at leastdoubling number of arcs and tokensdetailed analysis can help reducer number of extra arcs neededJosep Torrellas (UIUC)CS533April 28, 201516 / 27

Dynamic Dataflow MachinesAllow each iteration to be executed in a separate instance of thegraph. 2 Ways:Josep Torrellas (UIUC)CS533April 28, 201517 / 27

Dynamic Dataflow MachinesAllow each iteration to be executed in a separate instance of thegraph. 2 Ways:1Code copying: A new instance of subgraph is created per iteration;need mechanism to direct tokens to right instanceThis helps since different nodes will handle the execution for eachiterationIs a little like loop unrolling in that more parallelism is exposedhave to consider the overhead of having extra nodes2Tagged tokens: Attach a tag to each token saying what iteration itbelongs toEnabling rule: fire when tokens with same tag are present at each ofthe inputs of the nodeJosep Torrellas (UIUC)CS533April 28, 201517 / 27

Tag ManagementArthur H. Veen37490actualparameter1How to generate distinctarea in a distributedtagsAenvironment( too manybits) oPi,:;\ c x : y : 0whilex 10dox: x 1y : y h(x)odnewXnewYFigure Josep10. TorrellasAn implementation(UIUC)KnewtagI areahof a loop using tagged\iresultA:Figure 11. Interface for a procedure call. On the lefta call of procedure P whose graph is on the right.P has one parameter and one return value. Theactual parameter receives a new tag and is sent tothe input node of P and concurrently a token containing address A is sent to the output node. ThisSEND-TO-DESTINATIONnode transmits the otherinput token to a node of which the address is containedin the first token. The effect is that, when the returnvalue of the procedure becomes available, the outputnode sends the result to node A, which restores thetag belonging to the calling expression.CS533April 28, 201518 / 27

Tag ManagementArthur H. Veen37490actualparameter1How to generate distinctarea in a distributedtagsAenvironment( too manybits) oPi,:;\ c x : y : 0whilex 10dox: x 1y : y h(x)odnewXnewYFigure Josep10. TorrellasAn implementation(UIUC)KnewtagI areahof a loop using taggedA:\iresultExtra HW complexity, sincetags have to be matched. . . Tags also have storageoverheadFigure 11. Interface for a procedure call. On the lefta call of procedure P whose graph is on the right.P has one parameter and one return value. Theactual parameter receives a new tag and is sent tothe input node of P and concurrently a token containing address A is sent to the output node. ThisSEND-TO-DESTINATIONnode transmits the otherinput token to a node of which the address is containedin the first token. The effect is that, when the returnvalue of the procedure becomes available, the outputnode sends the result to node A, which restores thetag belonging to the calling expression.CS533April 28, 201518 / 27

Tag ManagementArthur H. Veen37490actualparameter1How to generate distinctarea in a distributedtagsAenvironment( too manybits) oPi,:;\ c x : y : 0whilex 10dox: x 1y : y h(x)odnewXnewYFigure Josep10. TorrellasAn implementation(UIUC)KnewtagI areahof a loop using taggedA:\iresultExtra HW complexity, sincetags have to be matched. . . Tags also have storageoverheadFigure 11. Interface for a procedure call. On the lefta call of procedure P whose graph is on the right.P has one parameter and one return value. Theactual parameter receives a new tag and is sent tothe input node of P and concurrently a token containing address A is sent to the output node. ThisSEND-TO-DESTINATIONnode transmits the otherinput token to a node of which the address is containedin the first token. The effect is that, when the returnvalue of the procedure becomes available, the outputnode sends the result to node A, which restores thetag belonging to the calling expression.The flexibility mayoverexpose parallelism,possibly causing deadlocksor storage problemsCS533April 28, 201518 / 27

Handling Proceduresactualparameter1KareaAoP\iresultA:Figure 11. Interface for a procedure call. On the lefta call of procedure P whose graph is on the right.P has one parameter and one return value. Theactual parameter receives a new tag and is sent tothe input node of P and concurrently a token containing address A is sent to the output node. ThisSEND-TO-DESTINATIONnode transmits the otherinput token to a node of which the address is containedJosep Torrellas (UIUC)CS533April 28, 201519 / 27

Handling ProceduresExtra facility required todirect the output token ofthe procedure to the propercalling site done bysending special tokencontaining return e 11. Interface for a procedure call. On the lefta call of procedure P whose graph is on the right.P has one parameter and one return value. Theactual parameter receives a new tag and is sent tothe input node of P and concurrently a token containing address A is sent to the output node. ThisSEND-TO-DESTINATIONnode transmits the otherinput token to a node of which the address is containedJosep Torrellas (UIUC)CS533April 28, 201519 / 27

Handling Proceduresactualparameter1KareaAoP\iresultExtra facility required todirect the output token ofthe procedure to the propercalling site done bysending special tokencontaining return nodeaddressSummary:Static: simpler, good forpipeliningDynamic: moreconcurrency but memoryand complexity overheadA:Figure 11. Interface for a procedure call. On the lefta call of procedure P whose graph is on the right.P has one parameter and one return value. Theactual parameter receives a new tag and is sent tothe input node of P and concurrently a token containing address A is sent to the output node. ThisSEND-TO-DESTINATIONnode transmits the otherinput token to a node of which the address is containedJosep Torrellas (UIUC)CS533April 28, 201519 / 27

Architecture of Dataflow Machinescomposed of several PEs that communicate with each otherexample of PE:Aenablingunitfunctionalunitmemory(for tokensand nodes)Enabling unit: accepts tokens from (A) and stores them at addressednode. If node is enabled: executable packet sent to FU to beprocessedOutput tokens, with destination addresses are sent back to enablingunittoken value are stored alongside the nodesJosep Torrellas (UIUC)CS533April 28, 201520 / 27

Tagged MachinesAmatchingunitmatchingunitmemoryfor tokensmemoryfor nodesfunctionalunitWhen a token arrives in (A), check in memory for tokens to see if all otherinputs with the same tag are here. If so, send all tokens to the fetchingunit, which combines them with a copy of the node description into anexecutable packet to be passed onto the functional unitJosep Torrellas (UIUC)CS533April 28, 201521 / 27

Machine ArchitectureOne level: Deliver tokens produced by a functional unit to theenabling unit of the correct PEdestination addressallocation policy is trickyjust like message passing except dataflow processorse.g. ArvindTwo-level: each FU consists of several functional elements whichconcurrently process executable packetse.g. Gurd (Manchester)Josep Torrellas (UIUC)CS533April 28, 201522 / 27

The Manchester Dataflow MachineGurd and Watson; University of Manchetser, UK. Started 1976; Working1981Two-level machine, although only 1 macroprocessor implementedMeant to be 1 PE in tokens carred in packetsaround pipelined ringinstructionstoreprocessingunitJosep Torrellas (UIUC)CS533April 28, 201523 / 27

The Manchester Dataflow MachineGurd and Watson; University of Manchetser, UK. Started 1976; Working1981Two-level machine, although only 1 macroprocessor implementedMeant to be 1 PE in toreoverflowunittokens carred in packetsaround pipelined ringMatching unit: pairstogether tokens destined forsame instructionprocessingunitJosep Torrellas (UIUC)CS533April 28, 201523 / 27

The Manchester Dataflow MachineGurd and Watson; University of Manchetser, UK. Started 1976; Working1981Two-level machine, although only 1 macroprocessor implementedMeant to be 1 PE in toreprocessingunitJosep Torrellas (UIUC)overflowunittokens carred in packetsaround pipelined ringMatching unit: pairstogether tokens destined forsame instructionPrograms with large data setoverflow in overflow unitCS533April 28, 201523 / 27

The Manchester Dataflow Machine d tokens fetch theappropriate instruction fromthe instruction storeconstains the machine codefor the dataflow p Torrellas (UIUC)CS533April 28, 201524 / 27

The Manchester Dataflow Machine ionstoreoverflowunitPaired tokens fetch theappropriate instruction fromthe instruction storeconstains the machine codefor the dataflow programinstruction inputsforwarded to PUprocessingunitJosep Torrellas (UIUC)CS533April 28, 201524 / 27

The Manchester Dataflow Machine ionstoreprocessingunitJosep Torrellas (UIUC)overflowunitPaired tokens fetch theappropriate instruction fromthe instruction storeconstains the machine codefor the dataflow programinstruction inputsforwarded to PUtoken queue: buffer tosmooth trafficCS533April 28, 201524 / 27

The Manchester Dataflow Machine ionstoreprocessingunitoverflowunitPaired tokens fetch theappropriate instruction fromthe instruction storeconstains the machine codefor the dataflow programinstruction inputsforwarded to PUtoken queue: buffer tosmooth trafficI/O switch connection tohostJosep Torrellas (UIUC)CS533April 28, 201524 / 27

Matching UnitJosep Torrellas (UIUC)CS533April 28, 201525 / 27

Matching Uniteach token tag indicates whether the destination is expecting one ortwo tokensJosep Torrellas (UIUC)CS533April 28, 201525 / 27

Matching Uniteach token tag indicates whether the destination is expecting one ortwo tokensif only zero or one token is needed, then the token is immediately sentto the fetching unit (instruction store) and linked with the nodeif the destination node is expecting two tokens, the matching unit issearched for the other portJosep Torrellas (UIUC)CS533April 28, 201525 / 27

PerformancePoor: 5-10 times slower than VAX-11/780 on RSIM (switch levelsimulation)better technology: claim 5x betteraverage parallelism AP S1 /S S1 time steps with 1 FUS time steps with FUsProgramAPLaplace 134-210Matmult236Rsim15-20Simple63Rsim and Simple are SISALfound that need AP 40 to get good speedupJosep Torrellas (UIUC)CS533April 28, 201526 / 27

Possible DrawbacksJosep Torrellas (UIUC)CS533April 28, 201527 / 27

Possible DrawbacksHard to properly balance the system (#FUs, mem sizes, etc.)Josep Torrellas (UIUC)CS533April 28, 201527 / 27

Possible DrawbacksHard to properly balance the system (#FUs, mem sizes, etc.)There is a lot of overhead with using branch/merge nodes that resultsin low instruction efficiencyToo much parallelism can negatively affect the systemJosep Torrellas (UIUC)CS533April 28, 201527 / 27

Possible DrawbacksHard to prope

Josep Torrellas (UIUC) CS533 April 28, 2015 1 / 27. Motivation What is a Von Neumann architecture? the traditional processor with register, pc (control unit), memory, alu Why is this design problematic for parallel systems? instructions can stall the pipeline, yet there still might be some