Memory Architectures - Virginia Tech

Transcription

Memory ArchitecturesWeek 2, Lecture 1Copyright 2009 by W. Feng. Based on material from Matthew Sottile.

Memory ArchitecturesDirectory-BasedCoherence Idea– Maintain pointers instead of simple states with each cache block. Ingredients– Data owners (“home” nodes) keeping track of nodes with cached copies(“sharers”).– Coherence decisions implemented via distributed linked-list updatealgorithms.– Accomplished via a doubly linked list formed between caches, withsharers linking to other sharers, and the home node acting as the listhead. Issues?Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesDirectory-Based Coherence Read Misses––––Transaction from requestor to home node.Home node replies with data or head of sharer list.Head pointer is changed to requestor.Requestor sends transaction to first sharer to insert into list. Write Miss––––Requestor obtains head information from home.Requestor is inserted as new head.If already on the list, it is deleted and re-inserted as head.List is traversed and shared copies are invalidedCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesRecall: Cache Coherence & Performance Unlike details with pipelining that only concerncompiler writers, you the programmer really need toacknowledge that this is going on under the covers. The coherence protocol can impact yourperformance.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesCache Coherence: Performance DemoCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory Architectures“Abusing” Cache Coherence Why the difference in performance? For the same overall computation, change theinterleaving of memory accesses between theprocessing cores.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesTraversal Patternsvoid worker unfriendly(void *arg) {int rank (int)arg;int start rank;int end N;int stride P;int i, iter;int numiters 256;for (iter 0; iter numiters; iter ) {for (i start; i end; i stride) {if (i 0 && i N-1)data[i] (data[i-1] data[i] data[i 1])/3.0;}}}Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesExplanation of Abusing Cache Coherence Each block shared across multiple caches. Each iteration, all threads read from the same block ( Shared) and try to write to this shared block ( Modified). One thread “wins” and forces all the other processors toinvalidate their block ( Invalidate). When one of the others executes a read, the writer isforced into the Shared state and must write back ( Shared). In unfriendly case, many transitions to the Invalid statemean blocks need to be re-read over and over by somecores. Result: Serialized code. So, we observesignificantly worse than sequential performance.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesPhilosophical Perspective Cache Coherence Protocols– Rarely addressed in the “popular” literature on multicore.– What is the focus? Libraries and languages.– Dangerous? Why or why not? Confucius say – “Understand the hardware consequences of code structure achieve good performance.” :-) Just like you need to understand caches to write good sequentialcode.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesSoftware-Based Cache Coherency (CC)? Question– Can one build a shared-memory software layer above adistributed memory system (such as a cluster) that takesconcepts from cache coherence protocols to implement a largescale shared memory view of a big cluster.” Answer: Possible but slow – Distributed Shared Memory (DSM) Distributed at hardware level; shared at programming abstraction.– Why slow? Bus transactions needed to make CC protocol work. Hardware: Bit level. Low BW. Time scales of a single clock cycle. Software: Bus level. High bandwidth.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesTo Date: Architectural Enhancements Coarse-Grained Parallelism: Core Replication Instruction-Level Parallelism (ILP)– Deep Pipelines, Superscalar, and VLIW Sophisticated Memory Hierarchy Lesson? Do sophisticated activities in hardware suchthat the abstraction layer seen by programmers is similaras possible to writing plain single threaded code. Assuming that enough parallelism exists, the key is memory performance.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesKey: Memory Performance Hide it or control it ILP and Caches: Hide it. Coherence– Control memory consistency with respect to multiple viewpoints.Hide how this is achieved in hardwareCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesMemory Access Timing Increase number of processors– Scaling of a bus interconnect becomes an issue.– More complexity needed to avoid excessive bus transfers Example: Avoid flushing or broadcasting to caches that are notaffected by a cache coherence conflict.– Longer-term solution?Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesMemory Access Timing Sequential Computer– Every location in memory takes the same amount of time toaccess. Small Shared-Memory Systems– Ditto. Also known as Uniform Memory Access (UMA). Larger Shared-Memory Systems– More complex memory subsystem makes constant-time accessprohibitively difficult.– Memory access takes different amounts of time based on wherethe physical memory exists in the machine relative to accessor.– Known as Non-Uniform Memory Access (NUMA). Cache-coherent NUMA ccNUMACopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesccNUMA DiagramCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesNUMA A model often found on parallel systems a decade ago.– Example: SGI Origin 2000, composed of 128 CPUs with ahierarchy of “bricks” from which the entire machine is built. Multicore– UMA: Intel. NUMA: AMD.– The world is moving to NUMA why?Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesUMA vs. NUMA UMA– Do not care where data actually lies relative to computations thatwork on it. NUMA– Do care where data is.– Better performance if data is near computation. Problem: Hard to enforce without OS and run-time assistance. Why? A combination of resource scheduling, load balancing, andcoupling memory management with process management. (Outside scope of this course just wanted you to be aware of this as more and more, “not all memory is treated equally from atiming perspective.”)Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesNUMA Another Perspective Think of NUMA as simply adding more layers to thecache hierarchy.– Memory near computation is faster, much like a local cache.– Memory far away from a computation is slower, much like mainmemory (or a cache deeper in the hierarchy). But why continue to add all these layers?– Remember: Memory and bus logic consume an increasinglylarge amount of real estate on a die, e.g., see Intel Nehalem die(previous lecture).– Memory performance becomes critical to code performance.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesTime for a New Approach? Part I Let’s start over “KISS: Keep It Simple Stupid”– Simplify the hardware. Get rid of speculative fetching logic. Get rid of cache coherence hardware get rid of all cache logic – Hmm let’s get rid of caches altogether (at least for mostprocessor cores)– Create a fast processor in terms of functional execution with nofancy memory hierarchy.– Use spare logic real estate to add high-performance functionalunits and a reasonable, but small, amount of RAM-like memoryright next to the processor. Much smaller than a cache, but much larger than a register set andaddressable in a RAM fashion.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesWelcome The CellCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesInitial Reactions Blasphemy!– The cache and memory subsystem help me write codeproductively. Why does Cell want to take this away?– How can I survive without a memory hierarchy that just “works” I want to focus on the functional logic of my program, notmemory traversal.– Hmmm maybe not Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesRe-Think Programming? Current View– Create sequences of instructions with memory as a secondaryconcern. Challenging Conventional Wisdom– How does my world as a programmer change if data (memory)movement is viewed on equal footing as computation?– Can I shift my way of thinking to exploit this new memory modeland exception computational capability? Answer: Probably, but with effort (which runs counter to codeproductivity maybe).Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home.– If we repaired plumbing the way we programmed, we’d Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home.– If we repaired plumbing the way we programmed, we’d Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home.– If we repaired plumbing the way we programmed, we’d Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home.– If we repaired plumbing the way we programmed, we’d Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home.– If we repaired plumbing the way we programmed, we’d Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home.– If we repaired plumbing the way we programmed, we’d Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home.– If we repaired plumbing the way we programmed, we’d Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home.– If we repaired plumbing the way we programmed, we’d Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home.– If we repaired plumbing the way we programmed, we’d Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home.– If we repaired plumbing the way we programmed, we’d Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program A perspective from the designer(s) of the Cell. Plumbing project at home.– If we repaired plumbing the way we programmed, we’d and so onCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program Example from a Code Perspectivex a(1)y b(2)z c(14)d(1) x y z A good compiler optimizes the above away But it’s not difficult to build algorithms that are complex enoughthat the compiler cannot optimize it all away, thus resulting ingoing back and forth to main memory over and over Possible causes?Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesHow We Currently Program Wasteful and inefficient? (See previous slides.)– Ask for memory when we need it, and it happens magically butwith potential performance impact Current Solution– Performance tuning, e.g., fetch multiple items to CPU at once bypacking into cache-friendly blocks.– Not always feasible Mesh-based data structures in finite element method (FEM) codeswhere one struggles with “structs of arrays versus “arrays of structs” Arbitrary data layouts and structures. “Plumbing” example. New Solution– Instead of back-and-forth, on-demand memory model, why notplan ahead and hand the system a shopping list?Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory Architectures“Shopping List” Model More efficient performance. (Obvious.)– But it requires programmers to plan ahead. Oh the pain of planning ahead – The “free ride” of cache coherency in this model does not exist.– Programmers must translate algorithms previously developedassuming lots of hardware on your CPU to support an ondemand memory model. (Remember pictures of dies?) Not so bad if algorithm already “computes” the memorytraversal pattern, e.g., butterfly pattern in the FastFourier Transform.(Are we sacrificing programmer productivity for performance?)Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory Architectures“Shopping List” Model More efficient performance. (Obvious.)– But it requires programmers to plan ahead. Oh the pain of planning ahead – The “free ride” of cache coherency in this model does not exist.– Programmers must translate algorithms previously developedassuming lots of hardware on your CPU to support an ondemand memory model. (Remember pictures of dies?) Not so bad if algorithm already “computes” the memorytraversal pattern, e.g., butterfly pattern in the FastFourier Transform.(Are we sacrificing programmer productivity for performance?)Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory Architectures“Shopping List” Model Plumbing project at home.– Using the “shopping list” model Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory Architectures“Shopping List” Model Plumbing project at home.– Using the “shopping list” model Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory Architectures“Shopping List” Model Plumbing project at home.– Using the “shopping list” model Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory Architectures“Shopping List” Model Plumbing project at home.– Using the “shopping list” model Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesSupport for the “Shopping List” Model Sophisticated memory access hardware and a high-BW,low-latency internal bus on the chip. Difficulty in programming to this model?– Not due to shortcoming of the chip but due to lack ofprogramming tools to support it. Historical perspective– Back in 2007, program to Cell via assembly.– IBM releases SDK with libraries to “help out” – Now compiler vendors are helping too. Gedae, RapidMind, and so on. Here to stay?!Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesThe Cell A collaboration between Sony, IBM, and Toshiba. Formal name: Cell Broadband Engine Architecture.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesThe Cell PPE: dual-threaded (SMT) PowerPC processing element SPE: singled-threaded, high-performance “synergisticprocessing element” EIB: circular data bus connection PPE and SPEs (and I/O memory)Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesPPE: PowerPC Processing Element Multi-Purpose– General-purpose CPU– Assist the SPEs by performing computatinos that help keep theSPEs running with minimal idle time. Anatomy– Registers 64-bit general-purpose and floating-point registers 128-bit registers for Altivec SIMD operations– “Caches” L1: 32-kB instruction, 32-kB data L2: 512 kBCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesSPE: Synergistic Processing Element High-performance specialized processor element thatcan be replicated within the Cell architecture.– Structure A high-performance execution core called the SynergisticProcessing Unit (SPU), which interacts with the outside world viathe Memory Flow Controller (MFC).– MFC: MMU, DMA, and bus interface.– SPE 128-bit registers 256-kB local store SRAM that is visible from PPE. (NOT a cache!)– Instruction and data memory (32-bit addressing)Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesEIB: Element Interconnection Bus A Bi-directional Circular Bus– Four 16-byte channels Two in each direction– 12 devices on the bus 8 SPEs, 1 PPE, 1 interface to system memory, and 2 I/O units. Each device has a single 16B read and a single 16B write port ontothe bus.– Maximum (Realizable) EIB Bandwidth: 200 GB/s. Ideal EIB Bandwidth: 300 GB/sCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesTime for an Old “New” Approach? Part II The Cell was not the first architecture to re-think howmemory is managed. Tera (now owned by Cray) created MTA: MultiThreadedArchitecture in the 1990s– Throw out the concept of a cache.– Basic idea Exploit massive levels of threading to hide high latency to mainmemory.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesMTA: Example with 6 ThreadsCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesMTA: Example with 6 ThreadsCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesMTA: Example with 6 ThreadsCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesMTA: Example with 6 ThreadsCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesMTA: Example with 6 ThreadsCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesMTA: Example with 6 ThreadsCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesMTA: Example with 6 ThreadsCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesMTA: Example with 6 ThreadsCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesMTA Overview A great deal of processor real estate to support thecontext for many threads.– That is, remove the cache(s) and hide latency by servicingmany threads. Dedicate hardware, otherwise used for caching,to service threads.– Remember this when we reach GPGPU Good performance requires a high degree of threadingsuch that memory latency can be hidden by servicingmany threads. (“Plumbing” project only had three threads.)– Disclaimer: In reality, MTA capable of more than just roundrobin schedule. Just wanted to get basic idea across.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesTime for an Old “New” Approach? Part III “Combining Ideas” from Part I and Part II “Plug-In” Architecture: GPGPU– Explicit data movement between CPU and GPGPU card out onPCIe (a la FPGA acceleration) “Shopping List” Model of Celloutside the chip– Massive parallelism via threading Memory Model of MTAwithin the chipCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesWelcome The GPGPUCopyright 2009 by W. Feng. Based on material from M. Sottile.

Memory ArchitecturesNext Time? Interconnection Networks Shallow coverage of interconnection networks Important to parallel computing– Fallen out of fashion over the past deacde or so. The return of its importance– Tilera 64-core, Intel 80-core, and so on. Thursday should be the last “architecturally-driven”lecture.Copyright 2009 by W. Feng. Based on material from M. Sottile.

Also known as Uniform Memory Access (UMA). Larger Shared-Memory Systems - More complex memory subsystem makes constant-time access prohibitively difficult. - Memory access takes different amounts of time based on where the physical memory exists in the machine relative to accessor. - Known as Non-Uniform Memory Access (NUMA).