A History Of Supercomputing - University Of Washington

Transcription

Burton SmithMicrosoftA History of Supercomputing1

!!!!!!!Processor arrays: Illiac IV, PEPE, and BSPOne-bit arrays: Staran, DAP, and the Goodyear MPPVector pipelining: TI-ASC, CDC Star-100, and Cray-1Wide instructions: FPS, Multiflow, and CydromeShared memory: Cray X-MP to SGI AltixDistributed memory: Cosmic Cube to Blue GeneJapan-US competitionWhat next?""MIMD systems""""DefinitionsEarly days: Colossus and EniacMainframes: Univac LARC to CDC 7600SIMD systemsOutline2

!!!!!!!One instruction at a time, each operating on an array of dataMultiple processors asynchronously issuing instructionsUMA: Uniform memory access by all processorsNUMA: Non-uniform memory access, based on placementUses fast and simple instructions to do complex thingsPipelining: Processing in assembly-line style"Distributed Memory: MIMD computer in which thememory is partitioned into processor-private regionsRISC: Reduced Instruction Set Computer""Shared Memory: MIMD computer in which a commonmemory is accessible by all processors"MIMD: Multiple Instruction Stream, Multiple Data Stream"Supercomputer: the world’s fastest computer at the timeSIMD: Single Instruction Stream, Multiple Data StreamDefinitions3

!!!!!!!!!BMD/ATC: Ballistic Missile Defense Advanced Technology Center,Huntsville, ALBRL: Ballistics (now Army} Research Laboratory, Aberdeen, MDDARPA: Defense Advanced Research Projects Agency, VANASA-Ames: Ames Research Center, Mountain View, CANASA-Goddard: Goddard Space Flight Center, Greenbelt, MDLivermore: Lawrence Livermore National Laboratory, Livermore, CALos Alamos: Los Alamos National Laboratory, Los Alamos, NMNSA: National Security Agency, Fort Meade, MDSandia: Sandia National Laboratories, Albuquerque, NMNames and Places4

!!!!for cryptanalysis of Lorenz SZ40/42 rotor systems (“Fish”)Only recently declassified by Her Majesty’s Governmentat the Post Office Research Station, Dollis Hill""""Paper tape loop data input at 30 mph 5000 characters/sec1500 vacuum tubes (“valves”) in the Mark I, 2400 in Mark IIProgrammed by switchesNot a general-purpose computerFeatures:"Concepts by Max Newman et al.Designed and built by Tommy Flowers""Used at Bletchley Park, England during WW IIEarly days: Colossus5

Colossus6

!!!!!!!The motivating application was artillery firing tablesIts first use was evaluating H-bomb feasibilityReliability was a concern, so it was left on continuouslyIt was 80 feet long and weighed 30 tonsProgramming was via switches and patch cablesNot a general-purpose computer"Designed and built by J. Presper Eckert and John MauchlyUnveiled at the Moore School on February 15, 1946It had 17,468 vacuum tubes and consumed 174 KW""Built at Penn’s Moore School for BRL, Aberdeen, MDEarly days: Eniac7

Eniac8

Univac 1 (1951)Zuse Z3 (1941)The IAS machines (1952)Manchester/Ferranti Mark I (1951)Other early supercomputers9

!!!These systems were used for both business and scienceLater, supercomputers became much more science-oriented"""""Magnetic core memoryTransistor logic circuitsFloating point hardwarePipeliningHigh level languages and compilersExamples of new technology included:""The 50’s and 60’s saw fast progress in computer technologyMost of it was first used in “mainframe” supercomputersMainframes10

!!!Begun in 1955 for Livermore and delivered in 1960Had dual processors and decimal arithmeticEmployed surface-barrier transistors and core memoryMainframes: LARC11

IBM 7030 (STRETCH) Delivered to Los Alamos 4/61 Pioneered in both architectureand implementation at IBM IBM 7951 Stream unit IBM 7952 Core storage IBM 7955 Tape unit IBM 7959 I/O Exchange IBM 7950 (HARVEST) Delivered to NSA 2/62 Was STRETCH 4 boxesMainframes: Stretch and Harvest12

!!!!!The chief architects were Seymour Cray and Jim Thornton""""No decimal arithmetic needed or wantedThis overlapped execution style is now called superscalarIt was coordinated by a famous circuit called the scoreboardSome say RISC means “Really Invented by Seymour Cray”A futuristic operator’s console was provided at no extra charge#Parallel I/O processors using a new idea called multithreading##10 arithmetic function units able to process in parallel#A very simple instruction set, lending itself to more speedThe 6600 had several notable features"Seymour Cray and Bill Norris left Univac to start CDCCray’s CDC 1604 was the first successful transistor systemHe wanted to build a scientific supercomputerThe 6600, built with silicon transistors, shipped Sept. 1964Mainframes: CDC 660013

CDC 6600 Console14

!!!!!!Jointly developed by IBM, American, and United AirlinesOriginally written in IBM assembly (i.e. machine) languageIt was very close to the CDC 7600 in performanceThe 360/91 also had a pretty impressive console ""The 360/91 (AKA 370/195) was IBM’s answerIt was delivered to NASA-Goddard in October 1967Killer App: Passenger Airline Reservation Systems (1964)"Last week, Control Data .announced the 6600 system. I understandthat in the laboratory.there are only 34 people including the janitor. Ofthese, 14 are engineers and 4 are programmers. Contrasting this modesteffort with our vast development activities, I fail to understand why wehave lost our industry leadership position by letting someone else offerthe world's most powerful computer."IBM’s Tom Watson was angry:Mainframes: IBM 360/9115

IBM 360/91 Console16

!!!!!This meant the arithmetic units were never busySeymour CrayIt was so compact it had to be cooled with freonIt was also one of the worlds most beautiful computers"The 7600 was delivered to Livermore in early 1969It was highly compatible with the 6600Besides being faster, its arithmetic was pipelinedMainframes: CDC 760017

Two CDC 7600s18

!!!!!!!Solomon was an early Westinghouse SIMD array prototypeFunded by DARPA from 1964 onward, it was usable in 1975The chief architect, Dan Slotnick, came from Westinghouse"Much compiler expertise was developed working on Illiac IVIt was to have 256 arithmetic units, later cut back to 64The thin-film memory system was a major headacheAfter student demonstrations at Illinois in May 1970, theproject was moved to NASA-AmesLanguages like IVtran (not pronounced four-tran) aimed touse parallel loops to express the necessary parallelism""The Illiac IV was a University of Illinois/Burroughs project"By the late 60’s, it was clear mainframes weren’t enoughTo improve performance, SIMD array machines were builtor proposed with many arithmetic processing unitsSIMD arrays: Illiac IV19

ILLIAC IV20

!!!It was delivered to BMD/ATC in Huntsville in 1976"""Richard Stokes was chief architectThe first customer was Marathon Oil Co. in Denver, COThe project was cancelled in 1980, before its first deliveryBSP had 16 processing elements and was aimed atcommercial use"Both were Burroughs designs derived from Illiac IVPEPE had 288 processing elements and was designed totrack ballistic missilesSIMD Arrays: PEPE and BSP21

!!!!#It could use whatever precision made sense for the problemThey were inexpensive to implement, especially on one chipArithmetic was serial (like grade-school, only binary)It was a 256 by 256 array with lots of data permuting powerKen Batcher, an Illinois graduate, was chief architectIt was great at image processing, though not intended for itThe ASPRO version was used on Navy AWACS aircraftThe array size was reduced to 128 by 128Delivered to NASA-Goddard in 1984"Stuart Reddaway was the architectICL (UK) developed the Distributed Array Processor""The Goodyear MPP was a single-chip STARAN successor""""Goodyear Aerospace developed STARAN in 1972""These were arrays of processors that were only one bit wideOne-bit Arrays: STARAN, DAP, MPP22

!!!!!The design came out of Danny’s PhD thesis at MITMany HPC users like NSA andLos Alamos bought themNot one per processor, sadlyThe CM-5 was MIMD, not SIMD"The CM-2 had an array of lightemitting diodes (right)"CM-2High-level language support was excellentThe machine proved useful for scientific computationThe CM-2/CM-200 added some32-bit floating point hardware""The CM-1 was a hypercube-connected one-bit array"Danny Hillis and his colleagues initially based the designon a “connectionist” approach to artificial intelligenceOne-bit Arrays: Connection Machine23

!!!This motivated the vector pipelining decisionThe ASC’s Fortran compiler was state-of-the-artGeophysical Fluid Dynamics Laboratory, Princeton, NJBallistic Missile Defense, HuntsvilleNaval Research Laboratory, Anacostia, MD"The recording was returned to the user with the job outputThe GFDL machine had an interesting performance tool: astrip-chart recorder perched on top that plotted vector length"""Several systems shipped, beginning in 1974""The ASC was designed to exploit Fortran loopsVector pipelining: TI ASC24

!!!APL is a language invented by Ken Iverson of IBM ResearchNeil Lincoln and Jim Thornton were the architectsSidney Fernbach at Livermore was the target customerThe result was highly limited applicabilityThis effect is called Amdahl’s law""Los Alamos backed outin favor of the Cray-1Livermore got both Star-100systems as a resultBoth Livermore and Los Alamoswere scheduled to get one""Star-100The Star-100 had poor scalar (non-vector) performance"""CDC used APL rather than Fortran loops as a vector modelVector Pipelining: CDC Star 10025

!!This is just the total work divided by the total timeThis is obviously not the average of s1 and s2Amdahl, Gene M, “Validity of the singleprocessor approach to achieving largescale computing capabilities”,Proc. SJCC, AFIPS Press, 1967"For example, if w1 9, w2 1, s1 100, and s2 1then s 10/1.09 9"If w1 work is done at speed s1 and w2 at speed s2,the average speed s is (w1 w2)/(w1/s1 w2/s2)Amdahl’s law26

!!!!!!Mr. Cray disliked government’s looking over his shoulderLos Alamos developed or adapted existing softwareThe lease was financed by a New Mexico petroleum personIts scalar performance was twice that of the 7600Once vector software matured, 2x became 8x or moreWhen people say “supercomputer”, they think Cray-1""The Cray-1 definitely did not suffer from Amdahl’s law"After the year was up, Los Alamos leased the system"Instead, Cray gave Los Alamos a one-year free trialAlmost no software was provided by Cray Research"Unlike the CDC Star-100, there was no developmentcontract for the Cray-1Vector Pipelining: Cray-127

Cray-128

!!!!!!It is a kind of SIMD system in which each arithmetic unit canbe doing something different (thus the need for width)Glenn Culler was the chief architectThese systems were embedded in GE CAT scannersThey were also used widely for seismic data processingThe first shipments of the Trace 7/200 were in 1987Bob Rau’s company, Cydrome, was also founded in 1984and demonstrated the Cydra-5 machine in 1987Both systems had instructions with seven fields and 256 bitsTechnically, none of these machines was a supercomputer"Multiflow was founded in 1984 by Josh Fisher to exploitsome compiler technology he had been pursuing at Yale"""Floating Point Systems led in 1976 with the AP-120B"VLIW stands for Very Long Instruction WordVLIW: FPS, Multiflow, and Cydrome29

!!!!Cray-1 (1976): 1 processorCray-2 (1985): up to 4 processors*Cray X-MP (1982): up to 4 procsCray Y-MP (1988): up to 8 procsCray C90: (1991?): up to 16 procsCray T90: (1994): up to 32 procsCray X1: (2003): up to 8192 procsCray-3 (1993): up to 16 procsCray-4 (unfinished): up to 64 procsCray-2*One 8-processor Cray-2 was builtAll are UMA systems except the X1, which is NUMA""Cray Computer, by Seymour Cray"""""Cray Research, not by Seymour Cray""Cray Research, by Seymour CrayShared Memory: Cray Vector Systems30

!!!Denelcor HEP (1980): up to 16 procsTera MTA-1 (1998): up to 16 procsCray MTA-2 (2002): up to 256 procs"Intel calls SMT “Hyper-Threading”A variant form of multithreading isSimultaneous Multithreading (SMT)"""Cray MTA-2Like vector multiprocessors, they can be UMA or NUMAExamples include"These systems pipeline MIMD architecture in the same wayvector systems pipeline SIMD architectureShared Memory: Multithreading31

!!!!Kendall Square Research used a variant called COMA,which stands for Cache-Only Memory ArchitectureOrigin 2000 (1997): up to 128 procsOrigin 3000 (2003): up to 512 procsAltix (2003): up to 512 procs*KSR1: (1991): up to 1088 procsKSR2: (1994): up to 5000 procsSGI Altix* “Columbia” at NASA-Ames clusters 20 of these""Kendall Square COMA systems:"""These systems automatically move data among caches ateach processor to minimize processor data access timesSGI CC-NUMA systems:"Cache-coherent NUMA systems were researched atStanford and developed as products by Silicon GraphicsShared Memory: CC-NUMA32

!!!The first system (1982) was called the Cosmic CubeIt had 64 Intel 8086 processors, each with an 8087 attachedThe processors were connected in a 6-dimensional hypercubeThey communicated by sending messages to each otherThis system architecture now dominates supercomputingnCUBE/10Ametek/S14 (1986): 68020, over 1000 procsIntel IPSC/1 (1985): 80286, up to 64 procsnCUBE/10 (1985): custom, up to 1024 procsVariations since then are legion"""Early commercial systems included:"""""Chuck Seitz and Geoffrey Fox pioneered the idea at CaltechDistributed Memory: Cosmic Cube33

!!""""Cray XT3IBM Blue Gene/LCray T3D (1994), T3E (1996), XT3Intel Touchstone series, Paragon (1992)IBM SP-1 (1992), SP-2, Deep Blue, Blue GeneAnd many othersA classic MPP is a distributed memory system whoseprocessing elements are individual (micro)processorsExamples include:Distributed Memory: Classic MPPs34

!!"""Earth Simulator (below)Usually based on Intel or AMD processorsEven Government Labs have done thisEarth Simulator (above)##IBM eServer p575NEC Earth Simulator (2002), SX-8A vast multitude of home-made “Beowulf” clustersA cluster MPP is a distributed memory system whoseprocessing elements are shared memory multiprocessorsExamples include:Distributed Memory: Clusters35

!!!www.top500.orgBy definition, it’s a great place for supercomputer spottingSpeed is measured by one benchmark program, LINPACKLINPACK doesn’t care how well the system is connectedShared memory isn’t useful for good LINPACK resultsWith that caveat, it’s well worth a look"""Be very careful about the term “fastest”, though""This is a list of the fastest computers in the worldThe Top 500 List36

!!!!This was followed by the VP200 (1982), VP400 (1985),VP2000 (1989), VP2600(1990), and distributed memoryVPP500(1992), VPP700(1996), and VPP5000(1999)Recently, Fujitsu has been selling scalar NUMA systemsSX-1 (1983), SX-2 (1985), SX-3 (1990), SX-4 (1994),SX-5 (2000), and distributed memory SX-6 and SX-7 (2002),Earth Simulator (2002), and SX-8 (2006)Tadashi Watanabe is the chief architect of these systems"An exception is the distributed memory SR8000 (2000)Hitachi has sold few supercomputers outside Japan""NEC got a late start on vectors, but caught up very quickly:""Japanese supercomputers mostly use vector processors withshared or (most recently) distributed memory architectureFujitsu shipped its first vector system (Facom 230) in 1976:Japanese Systems: Fujitsu, NEC, Hitachi37

!!About 23 billion over 9 yearsIntended to produce a supercomputerTwo projects: architecture and technologyCompanies: Fujitsu, NEC, Hitachi, Mitsubishi, Oki, Toshiba""""About 55 billion over 10 yearsCentered at a new laboratory: ICOTFocused on artificial intelligence: inference, vision, speechMCC in Austin, TX was a US industrial responseThe Fifth Generation Project (1982)""""The National Superspeed Computer Project (1981)Japanese National Projects38

!!!Thinking Machines: Connection Machine (1-bit SIMD array)Bolt, Beranek, and Newman: Butterfly (shared memory)Intel: Touchstone series (distributed memory)Tera: MTA (shared memory)Cray: T3D (distributed memory)Government agencies were encouraged to try these systems"""""This program was intended to accelerate the developmentand application of parallel computingIt funded several industrial development projects over theperiod 1983-1993:DARPA Strategic Computing Program39

!!!LaboratorySandiaLivermoreLos AlamosLivermoreLos paqCrayIBMThe performance range is about 100-foldSystem NameRedBlue PacificBlue MountainWhiteQRed StormPurpleDate1995199619982000200220042006The Department of Energy was directed to use computersimulation to replace nuclear weapons testingIt has procured a series of distributed memory systems:The DOE ASCI Program40

!!!!Intel and AMD are putting multiple processors on a chipIt’s only a few now, but just wait!Parallel application software will emergeParallel languages and compilers will enable itParallel architecture will evolve to matchThe cloud links themWe must make them safeWhat an exciting field it is with such challenges!""The world’s computers become one giant cluster"""This will change the supercomputer playing field, again""Parallelism is becoming mainstreamWhat Next?41

sysParsytecQuadricsSiemens-NixdorfSunApologies: Companies Not Mentioned42

A History of Supercomputing Burton Smith Microsoft. 2 . Mainframes: Univac LARC to CDC 7600! SIMD systems "Processor arrays: Illiac IV, PEPE, and BSP " One-bit arrays: Staran, DAP, and the Goodyear MPP " Vector pipelining: TI-ASC, CDC Star-100, and Cray-1 " . Mainframes: Stretch and Harvest Ł IBM 7030 (STRETCH Ł IBM 7950 (HARVEST