Multi-core Architectures - University Of California, San Diego

Transcription

Multi-core ArchitecturesRakesh Kumarrakumar@cs.ucsd.eduProgress of processor ha1000.00SparcMipsHP PA100.00Pow er PCAMD10.001.0085 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 051

Price being paidWatts/Spec10.1IntelAlphaSparcMipsHP PAPow er PCAMD0.01110100100010000Spec2000Lessons learnedMarginal utility of transistors decreasingIf n be the number of transistorsPower and Area are O(n)Performance is O(sqrt(n)) Wrong side of square lawIncreasingly difficult to squeeze performanceNot enough exploitable ILP in programsEasy ILP already extractedMore transistors available than we know to how makeuse of when applied to a single processorClearly, we have a problem!2

One way of handling a problem is .instead of confronting the problem try skippingto a simpler oneChange the focus from single-thread performance tothroughputDon’t have increasingly complex uniprocessorsHave multiple simple processors on the same dieinstead [Olukotun et al, ASPLOS96]Each on-chip processor (called core) can execute aprogram nowWe can now jump to the right side of thesquare lawIf n be the number of transistors on a die:Area O(n)Performance O(n1-x)Roughly O(sqrt(n))More aggregate performance (throughput) can be had using largenumber of small cores than small number of large coresAt the expense of single-thread performanceFor example,In terms of area:1 EV65 EV5 coresIn terms of throughput:1 EV6 2.0-2.2 EV5 cores5EV5 cores 2 EV6 cores Performance doubled just by having multiple cores!The main motivation for having multi-core architecures3

Multi-core Architecture: DefinitionA multi-core architecture (or a chip multiprocessor) is ageneral-purpose processor that consists of multiplecores on the same die and can execute programssimultaneouslyMulti-core architecture: Advantages(Relatively) High performance/watt(Relatively) High performance/areaSimpler corePossibility of lower cycle time, better optimisation etc.Ease of design, verification etc.4

So, the next question to ask obviously is How should one design a multi-core architecture?This is the question I address in my thesis researchA Naive methodology for Multi-core Design!"# "5

Goals of my thesis researchDemonstrate that the prior methodology is highlyinefficient in terms of area and powerDemonstrate the need to do holistic design of multi-corearchitecturesSubsystem design should be aware of the multi-corearchitecture it is going to be a part ofPropose and evaluate novel and efficient multi-corearchitecture design methodologies that follow a holisticapproachAssumptions inherent to the naïve approachAll cores have to be the sameEach core is distinctCore/memory and interconnect can bedesigned in isolationI will talk about the first assumption today6

Before scrutinizing the “identical cores” assumption. let’s consider characteristics of typical workloadsThere is enormous diversity among applications7

Implication of diversity on multi-core designIf all cores are to be identical, then can’t addressdiverse workload demandsE.g. need to decide beforehand if the core targets gcc ormcfEither way one application losesUnderutilization or low performanceAn example multi-core architecture 8

An example multi-core architecture%&%&%&%& %&%&%&%&%& %&%&%&%&%&%& %&%&%&%&%& %&%&%&%& Processors and Program diversitySome applications will run much faster on an EV6 thanon an EV5Others will take little advantage of the larger processorand run at the same speed on eitherWith a homogeneous architecture,you either have the former running very slowly on smallprocessors,or the latter unnecessarily wasting the capabilities of the largeprocessor.9

An alternate multi-core architecture%&%&%&%&%&%& %&%&%& An alternate multi-core architecture'(" %& (! ) (* %&10

Single-ISA Heterogeneous Multi-core ArchitecturesHave multiple heterogeneous cores on the samedieEach core-type represents a different point in the powerperformance spacei.e. while one core-type might be small lowperformance, low-power, some other core-type mightbe big high performance, high powerEach core capable of executing the same ISAUnlike SoCs/embedded heterogeneous multi-corearchitecturesSuch an architecture will be highly efficient on workloads with diverse applicationsAnother Performance Advantage: Adjusts to varying TLP11

Another Performance Advantage: Adjusts to varying TLP%&%& , - ( ( %& %&( * (( %&%&%&%&%& Comparing Single-ISA HeterogeneousArchitectures against Conventional CMPs874EV6Weighted Speedup65432101234567891011121314151617181920Num ber of threads12

Comparing Single-ISA HeterogeneousArchitectures against Conventional CMPs84EV6720EV5Weighted Speedup65432101234567891011121314151617181920Num ber of threadsA choice has to be made between throughput and ST performanceComparing Single-ISA HeterogeneousArchitectures against Conventional CMPs84EV63EV6 & 5EV5 (static best)20EV57Weighted Speedup6 .(5 04/%& ! ( ( 1%& ( *32101234567891011121314151617181920Num ber of threadsBest of both the worlds!13

Then there is intra-program diversity as mmitted instructions (in millions)Dynamic scheduling results74EV63EV6 & 5EV5 (random)63EV6 & 5EV5 (stat ic best)3EV6 & 5EV5 (bounded-global-event )54321012345678N um be r o f t hre a ds14

To sum up .Single-ISA Heterogeneous architectures a good designpoint for throughput as well as performance:Efficient use of die-area for a given thread-level parallelismProvides low-latency for few application on powerful coresA large number of applications can be hosted at once on simple coresEfficient adaptation to application diversityEnables it approach the performance of an architecture with a largenumber of complex coresProvides higher performance in the same area than a conventional chipmultiprocessorTalk OutlineAll cores have to be the sameSingle-ISA heterogeneous multi-corearchitecturesPerformance BenefitsPower Benefits15

Reducing power for a conventional multi-core architectureDone at the core-levelEach core optimised for power and then replicatedmultiple timesMulti-core obliviousProcessor power reduction typically involves V/f scaling,gating etc for the corePower reduction techniques applied at single-core level havelimited effectiveness2316

232317

4"(# !(#!(#Have multiple heterogeneous cores on the same dieMatch workload (or workload phase) to core thatachieves best efficiency according to some objectivefunctionPower down the unused cores completely18

An example Single-ISA heterogeneous multi-core architecture (# **5( 5(213,6!(Processor11&Peak-power (in W)Core-area (in mm 2)EV44.973EV59.835EV617.8024EV8-92.88260The processor only marginally bigger than EV8- !7 )# # ( *(19

Choosing Dynamically the Core with Least Energy(perf. loss mitted instructions (in millions)Choosing Dynamically the Core with Least Energy(perf. loss 01801Committed instructions (in millions)520

Choosing Dynamically the Core with Least Energy(perf. loss 10%)[Summary of results]Energy .13.4MaximumMinimumMeanResults “verified” by other researchers using real prototypes[Grochowski ICCD2004, Ghiasi CF2005]Realistic gy-delayNormalized Value (wrt EV8-)0.80.60.40.20neighbourneighborglobal5 random(allDynamicoracle8 / * 21

To sum up A single-ISA heterogeneous multi-core architecture offersenormous potential for even power-savingsRealistic heuristics can achieve much of the savingspotentialBeats chip-wide voltage scaling handsomely (50.6% ED2improvement)Subsequent research has shown this technique to better thandynamic V/f scaling, gating, adaptive optimizations etc.[Grochowski et al ICCD2004]BottomlineAll cores do not have to be the sameIn fact, should not be same22

Summary of talkDecreasing marginal utility of transistors isleading us to multi-core architecturesConventional multi-core architectures haveidentical coresHaving heterogeneous architectures leadto higher performance and lower power23

Price being paid 0.01 0.1 1 1 10 100 1000 10000 Spec2000 W a t t s / S p e c Intel Alpha Sparc Mips HP PA Power PC AMD Lessons learned Marginal utility of transistors decreasing If n be the number of transistors Power and Area are O(n) Performance is O(sqrt(n)) Wrong side of square law Increasingly difficult to squeeze performance Not .