Designer Seymour Cray And The Cray-3 Supercomputer, 1993

Transcription

Introducing the CRAY-3Supercomputer SystemsThe CRAY-3 is the first supercomputer to use galliumarsenide (GaAs) integratedcircuits for all of its logiccircuitry. The development ofGaAs digital circuits was afundamental step in enabling theCRAY-3 to attain the fastestclock cycle time available in acomputer system (two nanoseconds).The CRAY-3 offers a balancedcombination of high-speedvector processing, very fastscalar processing and the largestdirectly addressable memoryavailable in a general purposescientific computer (up to two gigawords).These features, combined with a highlyparallel architecture, make the CRAY-3 themost powerful system available to the scientific and engineering communities.The performance features of the CRAY-3 area result of a unique synthesis of system architecture and hardware technology.The CRAY-3 architecture is an evolutionaryextension of the CRAY-2 architecture. Thesystem cabinet illustrates the hardware technology required. All the logic and memorycircuitry for the machine resides in the topeight inches of an octagon-shaped cabinetonly 42 inches wide and 50 inches high.These top eight inches of the system cabinetcontain one to 16 computational processors,one system management processor, up to twogigawords of common memory and up to 15110 modules.The logic and memory circuitry are containedin three-dimensional modules only fourinches square by one-quarter of an inch thick.Packaging the architecture in this small spaceallows for short signal paths throughout thesystem.The combination of compact packaging andhigh-performance components was essentialto the development of a balanced, powerfuland high-speed system required by today'scustomers. It is unique to the CRAY-3.

CRAY-3 Design- -m--The CRAY-3 achievesits high-performanceprocessing capabilitieswith the use of GaAslogic circuitry, efficientpackaging, liquid immersion cooling, multipleprocessors and very large common memory.These hardware technologies are thencomplemented and maximized by the elegantarchitecture and functional design of theCRAY-3.functional units are employed in addressprocessing. Three functional units are dedicated solely to scalar processing, and twofloating-point functional units are shared withvector operations. Two additional functionalunits are dedicated to vector operationsallowing CRAY-3 systems to issue one resultper clock period in vector mode.Features of the Computation SectionLl Twos complement integer and signedmagnitude floating-point arithmeticBackground ProcessorsThe parallelism in the CRAY-3 extendsbeyond the multiprocessor features of thesystem. Each of the background processorsconsists of three sections: the computationsection, the control section and high-speedlocal memory. A broad mixture of scalar andvector arithmetic and logical operations cantake place at the same time in the computationsection. Instructions can issue every clockperiod. Computation instructions executeregister-to-register to allow them to operate atthe maximum rate possible. The controlsection supports the parallel operation of themultiple functional units in the computationsection. The high-speed local memory is usedto temporarily store scalar and vector dataduring computations. The peak performanceof each CRAY-3 background processor is onegigaflop.The computation section of the backgroundprocessors contains registers and functionalunits associated with address, scalar andvector processing. Two integer arithmeticU Address and arithmetic registers- Eight 32-bit address (A) registers- Eight 64-bit scalar (S) registers- Eight 64-element vector (V) registerswith 64 bits per elementU Address functional units- Addlsubtract- MultiplyCI Scalar functional units- Logical- Shift- lntegerAddlsubtractPopulationlparityLeading zero count1 Vector functional units- Logical- Shift- lntegerAddlsubtractPopulationlparityLeading zero countCompressed iotaU Floating-point functional units- Addlsubtract- Multiply/reciprocal/square rootScatter and gather vector operationsto and from common memory-1

The Background ProcessorsEight Octants of Common Memory(Up to Two Gigawords)1Background Processor 0 (Typical)Vector Registers1 of upto16a4, BanksIIRead AddresQmerator-I:;:::AddressFunct onalUn tsIBackground3 9 sI miunicationsChannel Loop 01 of 4 Chi8-Bit Channel to Console

Each background processor employs an identical, independent control section of registersand instruction buffers for instruction issueand control. Each control section has memory,base and limit registers for program relocationand protection. The instruction buffers (eightsegments with 16 words per segment) containthe instructions to be executed. A 64-bit realtime clock is synchronized with the foreground processor's 32-bit real-time clock atsystem start-up. Each clock is advanced byone count each clock period. Semaphore flagsare used to support application of multipleprocessors to a single program. These flagsare one-bit registers which provide interlocksfor common access to shared memory fields.A background processor is assigned access toone semaphore flag by a field in the statusregister. The background processor hasinstructions to test, branch, set and clear asemaphore flag.In addition to the registers and functionalunits, each background processor incorporates16,384 64-bit words of high-speed localmemory which is assigned by the compilersand used as fast scratch memory duringcomputations. This design minimizes thenumber of common memory access callsrequired and improves overall performance orthe system. Local memory accesses take fourclock periods and can overlap accesses tocommon memory.Features of the Local Memory Section16,384 64-bit wordsCI Access time of four clock periodsCI Used as fast scratch memoryduring computationsC3 Register accesses can overlap commonmemory accessesU Provides temporary storage of vectorsegmentsFeatures of the Control SectionCl 136 basic instruction codesU Reduces the number of commonmemory access calls required,improving overall performanceU Eight instruction buffersO 32-bit Program Address register32-bit Base Address registerCommon MemoryU 32-bit Limit Address registerO 32-bit Status registerO 64-bit Real-Time ClockU Multiple semaphore flags which provideinterlocks for multi-taskingThe CRAY-3 has the largest directly addressable high-speed memory available in ageneral purpose scientific cdmputer (up totwo gigawords). This vast memory resource isdirectly addressable by an applicationprogram. Such a large common memory

i Common Memorvallows the individual user to run programsthat would be impractical to run on any othercomputer system.Common memory is arranged in octants of 64banks each, providing up to 5 12 interleavedbanks for an eight-octant machine. Each wordconsists of 64 data bits and eight error correction bits. A memory bank can utilize 16 background processor memory access ports. Highspeed, silicon SRAM CMOS chips are used inall versions of the CRAY-3. Total memorybandwidth is 128 gigabytes per second with apeak burst transfer rate of one gigaword persecond per processor using two ports tomemory-a total peak burst rate of 16 gigawords per second.iIFeatures of Common MemoryIIIIIO Up to two gigawords availableU 72-bit words (64 data bits, eight correctionbits)O Up to 512 memory banks0 Each memory bank can utilize up to 16IIIbackground processor memory accessportsQ Bidirectional memory access portsU Bandwidth of 128 gigabytes per secondU High-speed SRAM CMOS memorytechnology used on all machinesIMemory assignmentsfor each of the16 processors.

Foreground SystemThe foreground processing system provides rinput, output and overall system management.The foreground system includes a 32-bit, twonanosecond CPU, with its own registers andmemory. It operates in parallel with the background processors.The foreground system monitors the systemcomponents via the foreground communication channels. It handles all 110 interrupts andtransfers, freeing the background processorsto asynchronously perform the computationsassociated with the user applications.System communication occurs through fourhigh-speed synchronous data channels (onegigabyte per second each). These channelsinterconnect the background processors, foreground processors, I s k control units and hostinterfaces. Each foreground communicationchannel connects to four background processors and one group of 110 controllers.LImmediate Addresslister,ns Channel Loor, (1 of 4)Functlon PulseThe majority of foreground processor activityinvolves data transfer between commonmemory and external devices. The systemprovides a mixture of 12 megabytes persecond low-speed interfaces and 100megabytes per second High PerformanceParallel Interfaces (HIPPI) to accommodatethe data transfer needs of the user.mResponse Puls UlSKlntertaceI

The Foreground Svstem, I/O and Disk SubsvstemsExternal 1 / 0 InterfacesA fully-configured CRAY-3 system can haveup to 15 interface modules. Control circuitryin the interface modules provides for the useof low-speed devices (six megabytes persecond), high-speed devices (12 megabytesper second) and HIPPI channels (100megabytes per second). One interface modulecan provide all three types of data transfermodes, or single modules can be dedicated toseveral of one kind. A single HIPPI interfacemodule can provide dynamic switching (software controlled) between four 32-bit channelpairs, or two 64-bit channel pairs, or two 32bit and one 64-bit channel pairs.-Disk subsystemsThe CRAY-3 system supports RedundantArrays of Inexpensive Disks (RAID) for highvolume and very fast sequential transfer rates.RAID units are connected via high-speed,IEEE standard, HIPPI channels with 32-bitand 64-bit HIPPI controllers for both sourceand destination. The HIPPI channels use a 40nanosecond clock for a burst bandwidth of100 megabytes per second.

CRAY-3 TechnologyThe technological innovations in the CRAY-3include gallium,arsenide (GaAs) digitallogic; 69-layer, threedimensional modules; and direct contactliquid immersion cooling using a clear, odorless, inert fluorocarbon.GaAs LogicThe CRAY-3 is the first supercomputer to usegallium arsenide integrated circuits for all ofits logic circuitry. The use of GaAs die was akey factor in enabling the CRAY-3 to achievethe fastest clock cycle time of any computersystem currently in existence.Component PackagingThe laws of physics demand that very highspeed electronic circuits must have short pathlengths. With a clock speed of 500 Mhz, theCRAY-3 required greater creativity and efficiency of packaging than had ever beforebeen attempted.The CRAY-3 logic and memory circuitry ispackaged in up to 336 removable modules,each containing up to 1,024GaAs integratedcircuit die. Total integrated circuit populationin a 16-processor CRAY-3 is over 142,000die, of which 36,864 are for commonmemory. This packaging results in a GaAsgate density of approximately 96,000 gatesper cubic inch.The modules are three-dimensional structuresmeasuring 121 mm by 107 mm by 7 mm(about four inches square by one-quarter of aninch thick). Nine printed circuit boards makeup the module sandwich and contain a total of69 electrical layers. Circuit connections aremade in all three dimensions within themodule. X-y traces are as small as 0.048 mm(a human hair averages 0.070 mm). Z-axisconnections are made with approximately14,000 gold-plated, beryllium-copper twistpin jumpers per module. The logic signaljumpers, which make up the bulk of the z-axisconnections, are only 0.122 mm in diameter.CoolingThe CRAY-3 is cooled by direct contactimmersion in an inert liquid fluorocarbon,technology similar to that employed in theCRAY-2. However, the high-power density ofthe CRAY-3 (up to 640 watts per cubic inch)GaAs die used in the CRAY-3 are only 3.835 mm square. They aremounted unpackaged to the printed circuit boards using 0.076 mm goldleads ultrasonically welded to 52 die bonding pads.

required further development of this coolingmechanism to provide adequate heat removal.A key factor in the design of the coolingmechanism for the CRAY-3 was the narrowchannels between the module layers throughwhich the cooling fluid must flow. The chinnels are a maximum of 300 microns from theback surface of a die to the adjacent printedcircuit board.The system tank is carefully engineered toensure that all of the coolant flows betweenthe module layers, rather than betweenmodules, where it comes in direct contactwith the die and gold jumpers for maximumheat transfer. Operating temperaturesthroughout the computer circuits average 30"Celsius with a total temperature rise of thecoolant across the cooling loop of only 5"Celsius. This allows for low thermal shockand hence long-term reliability of themodules.II rie bnnr-J nluoules are cunr nuallybathed in a precisely controlledflow of a clear, inert fluorocarbon. The presence of this liquid can onlybe visually detected by an occasional bubble, or when the tank is beingfilled.The CRAY-3 modules are a multi-layer sandwich of printed circuit boards containing69 electrical layers and four layers of GaAs die in a vertical space of only one-quarterof an inch.

1 CRAY-3 System ConfigurationsThe modular design ofthe CRAY-3 cabinetallows considerableflexibility in configuring systems tocustomers' needs. This design also allowsCRAY-3 systems to be upgraded in the field.The accompanying table lists some of theavailable configurations.Actual configurations of CPUs and memorycan be specified by the customer. This allowscustomers to tailor the machine to theirbudget and problem-solvingrequirements.I/O configurations will also be determined bythe needs of each specific customer. Each 110module can accommodate multiple channelsand speeds. HIPPI interfaces are available fornetwork and RAID disk connections. In addition, for customers with installed CrayResearch, Inc. systems interfaces are availableto support the following network connectionsand disk systems: Computer NetworkTechnology equipment, Network SystemsCorporation equipment, and Cray Research,Inc. DD-49 and DD-40 disk dules

CRAY-3 Four-Processor system cabinet.Our mission is to design,manufacture, sell and supporthigh-performance, general purposescientific computers.-

-CRAY-3 Applications EnvironmentTop-end supercomputers are used tosolve scientific problems that are notcomputationallytractable on less powerful machines. But to dothis the machines and their users need theapplications software and support which caneffectively utilize the inherent power andperformance of the machine.Application EnvironmentScientists developing codes in such diverseareas as energy research, image processing,seismic modeling, computational chemistryand structural analysis require tools thatenable them to apply all the computing powerof the CRAY-3 to their particular problems.Cray Computer Corporation is committed toproviding the application environment theyneed.Users connect to the CRAY-3 by using standard protocols. Familiar UNIX editors areused to create or modify their programs.Powerful compiler and multiprocessing toolsare used to prepare executable programs thatextract all the power of their CRAY-3. Finally,program output can be displayed using manyconnectivity options. When debugging isrequired a rich visual debugger speeds thetask.ApplicationsUser Fortran and C applications that meetexisting ANSI standards will easily convert tothe CRAY-3. In most cases users will onlyneed to move the source codes to theCRAY-3, recompile and execute code.The large memory of the CRAY-3 (up to twogigawords) gives users the opportunity toattack demanding problems much more efficiently than on a machine with a smallmemory.Scaling up a prototype solution to provide aproduction program need not involve elaborate programming to do work-aroundsbecause of limited memory. With the largememory of the CRAY-3 the user will mostlikely need to simply increase the parametersor dimensions of his code, proceed directly todebugging and run the production program.The production program remains flexible andis easily modified since there is no complex110 scheme obscuring the simplicity of theoriginal prototype program. Similarly,production programs and third-party codescan be easily ported without massiverewriting, speeding the entire process andincreasing productivity.CRAY-3 software gives users the tools neededto prepare and run an application. CrayComputer Corporation provides the performance tools to enable aggressive users tocapture the top performance potential in theirapplications. Flow-tracing tools locate themost time-consuming areas of a code. These,Ii11iI12- -I

tools also help the user analyze the multiprocessing performance and data flow.Accounting tools are also available to profilethe system resources required by an application. With these tools the user can locate thehigh-leverage opportunities for optimization,identify the kind of change required and thengo on to increase vectorization, parallelism or110 effectiveness. With a more effectivelyoptimized code the user gets improved turnaround-the scientific job gets done morequickly.Third-Party ApplicationsCRAY-3 software, with the standard UNIXenvironment and language support, providesan atmosphere that facilitates moving thirdparty applications from other supercomputersystems, particularly from the CRAY-2. TheCRAY-3 is a natural extension of the CRAY-2architecture, building upon the library ofinstructions previously available.A large repertoire of highly portable thirdparty applications is available for supercomputers and includes codes for reactor safety,computational chemistry, structural analysis,computational fluid dynamics and muchmore. Third-party investments that made theircodes available on the CRAY-2 will provide asimplified path for bringing codes to theCRAY-3. As the CRAY-3 product matures,Cray Computer Corporation will furtherrespond to customer application needs byworking with the vendors of third-partycodes.A CRAY-3 Four-Processorsystem solving large-scale climatologymodels at the National Center for Atmospheric ResearchHOX O tO.OBO Ub(3/lb/l0 HI.)IThe mix ngratlo in parts per trillion at 10 millibars (mb) ofthe radicals OH and H02 at 0000 UTC over the southernhemisphere, as solved by a global tropospheric chemicaltracer model on the CRAY-3 at the National Center forAtmospheric Research.

CRAY-3System SoftwareThe operating systemfor the CRAY-3 isbased on UNIX. Thefamiliar UNIX environment gives users a headstart as they attack their specific problems.The CRAY-3 provides an enhanced environment to support the diverse demands of topend users-from high-performance 110, whenthe user must process a great deal of data efficiently, to a standard batch environment tomeet the needs of production codes; from jobrecovery to ensure the completion of longrunning codes, to multiprocessing tools togive the user access to all the power of theCRAY-3.The system software package for theCRAY-3 also includes effectivesystem administration and tunablescheduling tools.rThe Fortran compiler complies with theANSI 77 standards and has extensions thatinclude compatibility with Cray Research,Inc. extensions, as well as parts of theFortran 90 vector syntax. The compilingsystem automatically detects opportunities forparallel execution and generates code to takeadvantage of them.The C compiler complies with the ANSI standard for the C language. It has extensions forvectorization and to allow use of multipleprocessors on a single application.Network support is provided via TCPIIP, NFSand other standard protocols.

-CRAY-3Physical SpecificationsThe CRAY-3 systemcabinet is relativelysmall considering- theamount of computingpower packed inside.The cabinet for one-, two- and four-processormachines is 106.68 cm (42 inches) wide and71.12 cm (28 inches) deep. The cabinet foreight- and 16-processor machines is a 106.68cm (42 inch) octagon. All system cabinetsextend 127 cm (50 inches) above thecomputer room raised floor. The cabinets areelegant in appearance with charcoal gray,matte-textured, gold-trimmed skins, andbronzed acrylic, see-through tops covering themodules.The system control pod is 133.35 cm (52.5inches) square and extends 140.46 cm (55.3inches) above the computer room raised floor.Frequently used controls are hidden behind abronzed acrylic panel. The entire top lid of theC-Pod can be raised electrically, giving accessto further controls as well as the C-Pod electronics and system cooling components. Allelectrical and cooling system connectionsbetween the C-Pod and the system cabinet arehidden beneath the computer room raisedfloor. The C-Pod must be within eight to 15feet of the system cabinet.

ICRAY-3 Support and MaintenanceFew things we buytoday are worth morethan the support andservice received afterthe sale. This becomeseven more important with greater initialinvestment and product sophistication. CrayComputer Corporation is committed tooffering the support our customers wouldexpect when purchasing one of the mostpowerful supercomputer systems availabletoday.Hardware SupportOur support begins with comprehensive preinstallation site planning. After installation afield support team takes the responsibility formaintaining proper operation of the machine.This support team will ensure correct operation through the use of on-line and off-linediagnostic tests. Preventive maintenanceprocedures are followed judiciously to maintain a high level of system performance. Anyrepairs that may become necessary are expedited by using a set of spare modules locatedright at the site, keeping downtime to aminimum.Software SupportCray Computer Corporation also provides onsite software support to assist customers inobtaining the maximum utilization of theirCRAY-3. This software support team assiststhe customer in installing, debugging andtuning Cray Computer Corporation softwareproducts. They will also consult with thecustomer's application programmers indebugging, porting and optimizing customerapplications. An on-line software problemdata base is also maintained to help in thetimely resolution of software problems.Continuing EngineeringAnother important aspect of support for asophisticated machine like the CRAY-3 is thatof continuing engineering. The continuingengineering group at Cray ComputerCorporation provides the necessary technicalengineering support for customizing eachCRAY-3 to the specific requirements of individual customers. This includes ensuring thatthe machine will properly interface with thecustomer's workstation consoles, networksand data storage systems. The continuingengineering group also provides customerswith product improvements and designenhancements over the life of the machine.

CRAYCOMPUTERC O R P O T I O N. 1 1'lo Bayfield Drive&E,yi.r. Colorado Springs, CO 809h6i-

number of common memory access calls required and improves overall performance or the system. Local memory accesses take four clock periods and can overlap accesses to common memory. Features of the Local Memory Section 16,384 64-bit words CI Access time of four clock periods CI Used as fast scratch memory during computations