The Future Of AMD

Transcription

The Future of AMDProcessors: ZenThomas Guerin & Jason Skrien

Agenda AMD’s Market Position Bulldozer ArchitectureBenefitsFlaws Zen ArchitectureEnhancement GoalsNew Memory HierarchySchedulingImproved Branch PredictionMicro-op CacheSimultaneous Multi-threading (SMT)Power SavingsNew Instructions Release TimelineConclusions & The Future of AMD

AMD’s Market Position AMD’s market share has declined from nearly 50% in Q12006 to a mere 20% in Q1 2016In May 2015, Kerrisdale Capital Investment claimed thatAMD would be bankrupt by the year 2020

AMD Bulldozer: The Good 40-entry scheduler gives good instructionlevel parallelism Inexpensive relative to current Intelproducts

AMD Bulldozer: The Bad Designed to increase frequencies whilemaintaining IPC of previous architecture Long pipeline resulted in high latencyBranch Prediction 5% miss rate (13% on 7zip Benchmark) 20 cycle penalty 2-way cache associativity is low Power inefficient

Bulldozer Architecture AMD implemented a 2 core per module scheme Two integer schedulersOne floating point schedulerL1 cache shared across “cores”L2 and L3 cache shared across modules

Bulldozer Flaws Software scheduling disaster Windows sees the load of eight modules rather than thefour cores Tasks may be scheduled to run on free cores of busymodules - not ideal CPU clocked higher in 4M/4C often faster than normalclock in 8C/4M

Introducing: Zen Extracting instruction-level parallelismPromises of 40% IPC improvement (28.6%reduction in CPI) Improved Instruction Level ParallelismLower powerNew ISA ExtensionScales from low-power notebooks toservers PCI Express 3.0: Rumored 36 lanes Up to 24 Lanes on Intel Kaby LakeUp to 38 lanes of 2.0 on Bulldozer

General Architecture Cores grouped per CPU Complex (CCX) 4 cores per CCXPrivate L1/L2 cachesShared L3 cache

New Memory Hierarchy L1: (private) 64k 4-way I-Cache32k 8-way D-CacheSwitched to write-backL2: 512k 8-way (private)L3: (shared) 8MB per CCX16-wayHolds blocks evicted from L1 & L2No redundant data from L2High associativity: reduce conflictsClaims of 5x L3 bandwidth

Scheduling Dual schedulers - one for int,one for floating pointInteger Rename Space 168registers6x14 scheduling queuesIncreased size of schedulerregister files 160 floating point entries192 integer entries

Improved Branch Prediction Two branches per Branch Target BufferNeural network machine learning methodologyStrided Sampling Hashed Perceptron Predictor In existing microprocessors Oracle SPARC T4 AMD Bobcat APUs S a msung Exynos 8890 (Galaxy S7)Keeps a history of instructionsSamples bits within that historyTraining finds correlations between history sample and outcome

Hashed Perceptron Branch Predictor1.2.3.Hash segments of branch history intomultiple tablesApply a threshold to the sum of theweights selected by the hash functions. Athreshold is applied to predict the branch.Update weights of neural network using“perceptron learning”

Hashed Perceptron MethodsNaïve Sampling:Strided Sampling:Geometric Sampling:

Micro-op Cache x86 is CISC not RISC Complex variable-length instructions Instructions decoded into micro-ops Cache offloads the complex decode hardware Decreases power consumption

Simultaneous Multi-threading (SMT) Intel Hyper-Threading is an example Two threads per physical core Keeps the FUs busy, increases IPC If one thread is blocked, the other can continue touse the core

Power Savings “Aggressive” clock gating Disables unused portions of the circuitry Stop components from switching statesChanging from 28 nm planar to 14 nm FinFET Improved transconductanceReduces power consumptionReduces required die size

New InstructionsADX(Add extended) Add two unsigned multiprecision integers plus carryRDSEEDComplements RDRAND instruction, generates seed for PRNGSMAPSupervisor Mode Access ProtectionSHA1SHA1 encodingSHA256SHA256 encodingCLFLUSHOPTCache Line FlushXSAVECSave CompactXSAVESSave SupervisorXRSTORSSave RestoreCLZEROClear line of cachePTE CoalescingMerge 4K page tables into 32K page tables

Zen Release Timeline “Summit Ridge” expected Late 2016/Early2017Low-end 8 cores/16 threads 200 - 300MSRPHIgh-end 8 cores/16 threads 500

Conclusions & The Future of AMD Hopefully Zen is the release AMD needs to pull itback from the edge of bankruptcy #MakeAMDGreatAgain AMD processors in the latest generation of g a meconsole promise to provide revenue in themeantime

References noloies/zen-cput ismt erarcy-revealedg c ch ft tp://www.pcgamesn.com/amd/amd-zen-release-date-spe chs-prices-rumourst asc ahl-gpu-has-17-billion-transistorst tp://www.extremetech.com/computing/100583-analyzin ghbulldozers-scaling-single-thread-performancet er mhath-delving-even-deeper/ nologym Yt tp://www.anandtech.com/show/10391/amd-briefly-show sh-off-zen-summit-ridge-silicont tp://www.theregister.co.uk/2016/08/22/samsung m1 co rhe/t tp://www.anandtech.com/show/10585/unpacking -amds- wellt tp://www.guru3d.com/news-story/amd-zen-8-core-summ iht-ridge-to-launch-january-17th-2017.htmlt tp://www.pcgamesn.com/amd/amd -zen-release-date-spe chs-prices-rumours

Zen Architecture Enhancement Goals New Memory Hierarchy Scheduling Improved Branch Prediction Micro-op Cache . 2-way cache associativity is low Power inefficient . Bulldoze