High-Performance Computing & Simulations In Quantum Many-Body Systems .

Transcription

High-Performance Computing & Simulationsin Quantum Many-Body Systems – PART IThomas Schulthessschulthess@phys.ethz.ch

What exactly ishigh-performancecomputing?

1E101E91E8relative performance1E71000000Why even bother about computer performance?100000100001000100computer speed10119701975198019851990Source: David P. LandauTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics19952000

Tuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Application performance seems to keep up withsupercomputing systems performance (!) 100 Kilowatts 5 Megawatts20-30 MW 1 Exaflop/s1.35 Petaflop/sCray XT5150’000 processors100 millions or billionprocessing cores (!)1.02 Teraflop/sCray T3E1’500 processors1 Gigaflop/sCray YMP8 processors19881998First sustained GFlop/sGordon Bell Prize 1989First sustained TFlop/s First sustained PFlop/s Another 1,000x in sustainedGrondon Bell Prize 1998 Gordon Bell Prize 2008 performance increase (?)Tuesday, July 06, 20102008Boulder School for Condensed Matter & Materials Physics2018

Plan for this lecture! What is HPC and why worry?! Historic background of scientific computing – how we cameto where we are today! Bottleneck and complexities of today’s processors! Parallel computers and parallel programming models! Extrapolating Moore’s Law into the future – why condensedmatter physicists could be interested in computingTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Electronic computing: the beginnings1939-42: Atanasoff-Berry Computer - Iowa State Univ.1938: Konrad Zuse’s Z1 - Germany1943/44: Colossus Mark 1&2 - BritainZuse and Z3 (1941)Z4 @ ETH(1950-54)1945-51: UNIVAC IEckert & Mauchly - “first commercial computer”1945: John von Neumann report that definesthe “von Neuman” architectureTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Von Neumann Architecture:Invented by Eckert and MauchlyDiscussed in report by von Neumann (1945)MemoryControl UnitArithmeticLogic UnitaccumulatorInputOutputstored-program concept general purpose computing machineTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Since the dawn of High-performance computing:Supercomputing at U.S. Dept. of Energy laboratories1946: ENIACNicholas Metropolis, physicists & leaderof group in LANL’s T Division thatdesigned MANIAC I & II1952: MANIAC I1957: MANIAC II.1974: Cray 1 - vector architecture.1987: nCUBE 10 (SNL) - MPP architecture1993: Intel Paragon (SNL)1993: Cray T3D.2002:Japanese Earth Simulator - Sputnik shock of HPCPeak: 1.382 TF/sQuad-Core AMD Freq.: 2.3 GHz150,176 compute coresMemory: 300 TB2004: IBM BG/L (LLNL)2005: Cray Redstorm/XT3 (SNL)2007: IBM BG/P (ANL)2008: IBM “Roadrunner”2008: Cray XT5 (ORNL)Downloaded 03 Jan 2009 to 128.219.176.8. Redistribution subject to AIP license or copyright; see http://jcp.aip.org/jcp/copyrightTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Today’s important types of processor architectures! Scalar processor: process one data item (integer / floatingpoint number) at a time! Vector processor: a single instruction operates on manydata items simultaneously! Typical processor today: “pipelined superscalar”! Superscalar: simultaneously dispatch multiple instruction to redundantfunctional units (multiplier or adder)! Pipeline: set of processing elements connected in a series! Example: 2 multiplies and two add per cycle(4 floating point operations per cycle)The good news: by and large compiler-level optimization will take care of this complexityTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Von Neumann Architecture:MemoryMemoryCPUControl UnitArithmeticLogic UnitaccumulatorI/O unit(s)InputOutputstored-program concept general purpose computing machineTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Computers in the past and today1970s (*) my laptop improvementclock(CPU)6 MHz 2GHz300 xFlop/s 6 MFlop/s 8 GFlop/s310RAM128kB8GB106 xMem.latency850ns 100ns20 x(*) Charles Thacker’s computer in the 1970sTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physicsx

Memory hierarchy to work around latency andbandwidth problemsFunctional unitsCPURegistersExpensive, fast, smallInternal cash 100 GB/s 6-10 ns 50 GB/sExternal cashCheap, slow, largeTuesday, July 06, 2010Main memoryBoulder School for Condensed Matter & Materials Physics 10 GB/s 75 ns

Moore’s Law is still alive and wellillustration: A. Tovey, source: D. Patterson, UC BerkeleyTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Single processor performance is no longer trackingMoore’s LawTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Performance increase due to exploding number ofprocessing coresillustration: A. Tovey, source: D. Patterson, UC BerkeleyTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Multi-core processors (since middle of decade)Tuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Distributed vs. shared memory SharedmemoryTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Interconnect types on massively parallelprocessing (MPP) systems – distributed memoryRAMRAMRAMCPUCPUCPU.CPUNIC &RouterNIC &RouterNIC &Router.NIC &RouterNICNIC &RouterNIC &RouterNIC &Router.NIC &RouterCPUCPUCPUCPU.CPURAMRAMRAMRAMSwitch(es) / router(s)RAM.NICNICCPUCPURAMRAMTuesday, July 06, 2010.Boulder School for Condensed Matter & Materials PhysicsRAM

Infiniband networks (ethernet) – separatingcompute partition from router/switchSwitch(es) / router(s).NICNICCPUCPURAMRAMTuesday, July 06, 2010NIC.CPURAM! Open / commodity network! More flexibility with topology(usually fat tree; but hypercube, dragon fly, etc. alsopossible)! Scales to only up to 104nodes (ideal for smallclusters)! Latency can be as low asmicrosecond! Bandwidth not as high asproprietaryBoulder School for Condensed Matter & Materials Physics

Proprietary network – Integrated router/NIC(network interconnect chip) and compute node! Proprietary networks (today)! IBM BG/P – torus fat tree! Cray Seastar (XT5) – torus! Cray Gemini (XE6) – torus! Reliable and scales to 100knodes! Higher bandwidth (similar toPDIe)! Latency slightly high thaninfinibandTuesday, July 06, 2010RAMRAMRAMCPUCPUCPU.CPUNIC &RouterNIC &RouterNIC &Router.NIC &RouterNIC &RouterNIC &RouterNIC &Router.NIC &RouterCPUCPUCPU.CPURAMRAMRAMBoulder School for Condensed Matter & Materials PhysicsRAMRAM

Complexity of interconnectError detected andcorrected at theoffending linkLink with ErrorIBIBIBIBIBIBIBIBSource Node mustretain copies of allpotential in-flightmessages – an O(n2)problem Tuesday, July 06, 2010IBIBIBIBError detected at thedestination. Packet isdiscarded. ResentaftertimeoutBoulder School for Condensed Matter & Materials PhysicsIBIBIBIB

Interconnects in the TOP500 systemsLCI 2007Tuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Programming models (I): message passing! Concurrent sequential processescooperating on the same task! Each process has own privatespace! Communication is two-sidedthrough send and receive! Large overhead!.! Lots of flexibility in decomposinglarge problems, however, providesonly fragmented view of theproblem! All burden is placed on the programmerto maintain global view! Examples are message passinglibraries like MPI or PVMTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Programming models (II): shared memory.! Multiple independent threads operateon same shared address space! Easy to use since there is only onetype of memory access! One-sided remote access (low overhead)! Application view remains integrated(global view)! Shared memory hardware doesn’tscale (local & remote memoryaccess)! It is difficult to exploit inherent datalocality - degradation of performance!! Examples are OpenMP or Pthreads! Compiler directive used with C, Fortran, .Tuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Programming models (III): data parallel! Concurrent processing of many dataelements in the same manner! Executing only one process (onmany processors)! Major drawback: does not permitindependent branching.! Not good for problems that are rich infunctional parallelism! Popular examples are C* and HPF! Revived today with GPGPU & CUDATuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Programming models (IV):distributed shared memory.Also called partitioned global addressspace (PGAS) model! Independed threads operate in sharedmemory space! preserve global view of program! Shared space is locally partitionedamong threads! allows exploiting data locality! “Single program multiple datastream” (SPMD) execution! independent forking (functional parallelism)! Popular examples: UPC and co-ArrayFortran; or Global-Array library! May still not have the same flexibilityas Message Passing ModelTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Distributed shared memory or PGAS:keeping the best from all other models.Hardware optimized PGAS:Cray XE6 with Gemini interconnect – fall2010 @ NERSC and Edinburgh (first XE6cabinet @ CSCS since June 2010)IBM BG/Q with new interconnect – late2011 @ LLNL; 2012 ANL & JulichTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Aspects of performance - typical values in 2009/10! Floating point (integer) performance: 2 or 4 per cycle! Flop/s floating point operation per second! 2.4 GHz processors: 9.6 GFlop/s per core! Memory latency: 50 ns! Memory bandwidth: 10 GB/s per core! Interconnect latency 1-10 !s! Network bandwidth: 5-10 GB/s! Disk access time ms! I/O bandwidth (disk) Gigabit/sTuesday, July 06, 2010Boulder School for Condensed Matter & Materials PhysicsCray XT5 node

Summary: Brutal fact of modern HPC!!!!Mind boggling numbers of processing processing unitsProcessor complexity (multi-core, heterogeneous, memory)Interconnect is a non-trivial part of the HPC systemParallel programming model characterized by memory model! Shared memory (OpenMP), distributed memory (MPI), data parallel (HPF)! Distributed shared memory (PGAS such as UPC, CAF, Global Array)! Accessing memory is prohibitively expensive compared to thecost of floating point operations! 1960s: transistors were expensive, memory access was cheap! today: transitors are cheap, memory access is expensiveKey aspect of programming in HPC systems:All about managing resourcesTuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

1 EFlop 2019Tuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

1000 fold increase in performance in 10 years: previously: double transistor density every 18 months 100X in 10 yearsfrequency increased now: “only” 1.75X transistor density every 2 years 16X in 10 yearsfrequency almost the sameNeed to make up a factor 60 somewhere elseSource: Rajeeb Hazra’s (HPC@Intel) talk at SOS14, March 2010Tuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Source: Rajeeb Hazra’s (HPC@Intel) talk at SOS14, March 2010Tuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Limits of CMOS scalingVoltage, V /αOxide layerthickness 1nmtox /αWIRINGW/αGATEn drainn sourceL/αxd /αp substrate, doping α NASCALINGVoltage:Oxide:Wire width:Gate Width:Diffusion:Substrate:V /αtox /αW/αL/αxd /αα NACONSEQUENCE:Higher density: α2Higher speed: αPower/ckt: 1/α2Power density: constantThe power challenge today is a precursor of morephysical limitations in scaling – atomic limit!Tuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

Opportunity for outsiders: A major disruption musthappen before 2025! Current CMOS technology will reach its physical limits bythe end of the decade and will seise to scale! The other physical limitations is the speed of light! Light travels 30cm in 1ns – this is several cycles!! Supercomputers can’t just become larger as they did the past decade! Enthusiasm for GPGPU and hybrid systems indicates that achange in architecture is happening! But this is still within current thinking and paradigm of digital computersHuge opportunities for new materials,devices and condensed matter physics!Tuesday, July 06, 2010Boulder School for Condensed Matter & Materials Physics

High-Performance Computing & Simulations in Quantum Many-Body Systems - PART I Thomas Schulthess schulthess@phys.ethz.ch. What exactly is high-performance computing? Tuesday, July 06, 2010 Boulder School for Condensed Matter & Materials Physics 1970 1975 1980 1985 1990 1995 2000 1 10 100 1000 10000 100000