Tsubame 2.0 Experiences-Petascale Computing With GPUs PDF Free Download

2y ago

30 Views

1 Downloads

6.11 MB

63 Pages

Report/dmca

Download PDF

Transcription

Tsubame 2.0 Experiences-PetascaleComputing with GPUs WorksSatoshi MatsuokaTokyo Institute of TechnologyGTC 2011, Beijing, China,2011/12/14

Computational DensityGPUs as Modern-Day Vector EnginesAccelerators: Grape,N-body ClearSpeed,MatMulGPUsCache-basedCPUsTwo types of HPC Workloads High Computational Density Traditional Accelerators High Memory Access Frequency Traditional Vector SCsScalar CPUs are so-so at both Massive system for speedupGPUs are both modern-day vectorengines and high computedensity accelerators Efficient Element of NextGen SupercomputersVector SCsCFD Small memory, limited CPU-GPUBW, high-overheadCray, SX communication with CPUs andMemory Access Frequencyother GPUs, lack of system-levelSW support incl. fault tolerance,Our Research@Tokyo Tech NEC Confidential programming, FFT

GPU vs. CPU PerformanceRoofline model: Williams, Patterson 2008Communication s of the ACM10GeForceGTX 285708 GFlopsx10-20MD (N log N)x5-10memoryboundN-Body (N log N)1SGEMM2LBMRiken BMT10FFT (NUKADA)10-DdiffusionPerformance [GFlops]GFLOPS3FLOP/Byte F/BCore i7Extremecomputebound010 -21010-1100FLOP/Byte1013

TSUBAME2.0 Nov. 1, 2011“The Greenest Production Supercomputer in the World”TSUBAME 2.0New Development4

Highlights of TSUBAME 2.0Design (Oct. 2010) w/NEC-HP 2.4 PF Next gen multi-core x86 next gen GPGPU 1432 nodes, Intel Westmere/Nehalem EX 4224 NVIDIA Tesla (Fermi) M2050 GPUs 100,000 total CPU and GPU “cores”, High Bandwidth 1.9 million “CUDA cores”, 32K x 4K 130 million CUDA threads(!) 0.72 Petabyte/s aggregate mem BW, Effective 0.3-0.5 Bytes/Flop, restrained memory capacity (100TB) Optical Dual-Rail IB-QDR BW, full bisection BW(Fat Tree) 200Tbits/s, Likely fastest in the world, still scalable Flash/node, 200TB (1PB in future), 660GB/s I/O BW 7 PB IB attached HDDs, 15PB Total HFS incl. LTO tape Low power & efficient cooling, comparable to TSUBAME 1.0( 1MW); PUE 1.28 (60% better c.f. TSUBAME1) Virtualization and Dynamic Provisioning of Windows HPC Linux, job migration, etc. NEC Confidential

TSUBAME2.0 System Overview (2.4 Pflops/15PB)Petascale Storage： Total 7.13PB(Lustre Accelerated NFS Home)Lustre Partition 5.93PBx5Home NFS/iSCSIMDS,OSSHP DL360 G6 30nodesStorageDDN SFA10000x5(10 enclosures x5)Lustre (5 Filesystems)OSS: 20 OST: 5.9PBMDS: 10 MDT: 30TBOSS x20MDS x10Storage ServerHP DL380 G6 4nodesBlueArc Mercury 100 x2StorageDDN SFA10000 x1（10 enclosures x1）NFS,CIFS x4Tape SystemSun SL8500 net3Node Interconnect: Optical, Full Bisection, Non-Blocking, Dual-Rail QDR InfinibandCore SwitchEdge Switch(w/10GbE)Edge Switch12 switches6 switches179 switchesVoltaire Grid Director 4700IB QDR: 324portsVoltaire Grid Director 4036IB QDR : 36 portsVoltaire Grid Director 4036EIB QDR:34ports10GbE: 2portsMgmt ServersCompute Nodes： 2.4PFlops(CPU GPU) 224.69TFlops(CPU)“Thin” Nodes・・・・・・NEW DESIGN Hewlett Packard CPU GPUHigh BW Compute Nodes x 1408Intel Westmere-EP 2.93GHz(TB 3.196GHz） 12Cores/nodeMem:55.8GB( 52GiB)or 103GB( 96GiB)GPU NVIDIA M2050 515GFlops,3GPUs/nodeSSD 60GB x 2 120GB ※55.8GB node120GB x 2 240GB ※103GB nodeOS: Suse Linux Enterprise Windows HPC1408nodes(32node x44 Racks)“Medium” Nodes“Fat” NodesHP 4Socket Server 24nodesCPU Nehalem-EX 2.0GHz32Cores/nodeMem:137GB( 128GiB)SSD 120GB x 4 480GBOS: Suse Linux Enterprise24 nodes6.14TFLOPSHP 4Socket Server 10nodesCPU Nehalem-EX 2.0GHz32Core/nodeMem:274GB( 256GiB)x8549GB( 512GiB) x2SSD 120GB x 4 480GBOS: Suse Linux Enterprise10 nodes2.56TFLOPSPCI –E gen2 x16 x2slot/node4224 NVIDIA “Fermi” GPUsMemory Total：80.55TBSSD Total：173.88TBGSIC:NVIDIA Tesla S1070GPU (34 units)NEC Confidential

Tsubame2.0 (2010-14)x30 speedup c.f. Tsubame 1(2006-2010)2.4 Petaflops, 1408 nodes 50 compute racks 6 switch racksTwo Rooms, Total 160m21.4MW (Max, Linpack), 0.48MW (Idle)

TSUBAME2.0 StorageMulti-Petabyte storage consisting of Luster Parallel Filesystem Partitionand NFS/CIFS/iSCSI Home Partition Node SSD AccelerationLustre Parallel Filesystem Partition, 5.93PBMDS：HP DL360 G6 x10Home Partition 1.2PBNFS/CIFS：HP DL380 G6 x4-CPU:Intel Westmere-EP x2 socket(12 Cores)-Memory：51GB( 48GiB)-IB HCA:IB 4X QDR PCI-e G2 x2port-CPU:Intel Westmere-EP x2 socket(12 Cores)-Memory：51GB( 48GiB)-IB HCA:IB 4X QDR PCI-e G2 x1portNFS/CIFS/iSCSI：BlueArc Mercury100 x2OSS：HP DL360 G6 x20-10GbE x2-CPU:Intel Westmere-EP x2 socket(12 Cores)-Memory：25GB( 24GiB)-IB HCA：IB 4X QDR PCI-e G2 x2portストレージ：DDN SFA10000 x1-Total Capacity：1.2PB2TB SATA x 600 DisksStorage：DDN SFA10000 x5-Total Capacity：5.93PB2TB SATA x 2950 Disks 600GB SAS x 50 Disks1SFA10K 600 DisksOST, MDT2SFA10K 600 DisksOST, MDT34SFA10K 600 DisksOST, MDT5SFA10K 600 DisksOST, MDTSFA10K 600 DisksOST, MDTHomeSFA10K 600 DisksSFA6620 100 Disks7.13PB HDD 200TB SSD 8PB TapeNEC Confidential4-3

200 TB SSD (SLC, 1PB MLC future)7.1 PB HDD (Highly redundunt)4-8 PB Tape (HFS Backup)All Inifiniband 10GbE ConnectedLustre GPFSNFS, CIFS, WebDAV, Tivoli HFSGridScaler BlueArcNEC Confidential

TSUBAME2.0 Compute Nodes1408nodes：ThinNode4224GPUs：59,136 SIMD VectorCores, 2175.36TFlops (Double FP)Infiniband QDRx2 (80Gbps)2816CPUs, 16,896 Scalar Cores:215.99TFlopsHP SL390G7 (Developed forTSUBAME 2.0)Total： 2391.35TFLOPSGPU: NVIDIA Fermi M2050 x 3515GFlops, 3GByte memory /GPUCPU: Intel Westmere-EP 2.93GHz x2(12cores/node)Memory: 54, 96 GB DDR3-1333SSD：60GBx2, 120GBx2IB QDRPCI-e Gen2x16 x2NVIDIA TeslaS1070 GPUHP 4 Socket ServerCPU: Intel Nehalem-EX 2.0GHz x4(32cores/node)Memory: 128, 256, 512GB DDR3-1066SSD：120GB x4 (480GB/node)Memory： 80.6TB (CPU) 12.7TB(GPU)SSD： 173.9TB34 nodes:8.7TFlopsMemory:6.0TB GPUSSD: 16TB NEC ConfidentialTotal Perf2.4PFlopsMem： 100TBSSD: 200TB4-1

SL390 Compute NodeCollaborative Development w/HP3 GPUs, 2CPUs, 50-100GB Mem120-240GB SSD, QDR-IB x 2NEC Confidential

3500 Fiber Cables 100Kmw/DFB Silicon PhotonicsEnd-to-End 6.5GB/s, 2usNon-Blocking 200Tbps BisectionNEC Confidential

TSUBAME 2.0 Full Bisection Fat Tree, Optical,Dual Rail QDR Infiniband5 UnitsSFA10K・・・・・20 NodesSFA10KOSS OSS・・・MDS MDSOSSMgmtMgmt・・・・・・・ServerServerMDS MDS・・・Home/iSCSI NodesIB and Ether10 NodesOSS10Gb Ethernet x2Mgmt Nodes x51Lustre Filesystem4 Nodes2 Nodesstorage・・・ storagemercuryserverservermercury10Gb Ethernet x10Voltarie Grid Director 4036E x6 Grid Director 4036 x1Optical Intra-Node Interconnect8PB SunSL8500Voltaire Grid Director 4700 x12SuperTitanetSecond RailFirst RailSuperSINET3貴学ご提供のシステムFull Bisection, Non blocking, Optical Fiber ConnectionEdge SW#1 Edge SW#2Thin Thin ThinNode Node Node・・・Thin Thin ThinNode Node Node16 NodesThin Nodes x1408Edge SW#3 Edge SW#4Thin Thin ThinNode Node NodeRack#1・・・Thin Thin ThinNode Node Node16 Nodesx44RacksEdge Switch x2Med. Med.Node NodeMed. Med.Node・・・・・ NodeFatNodeFat・・・・・・NodeMedium Nodes x24 Fat Nodes x10TSUBAME2.0ネットワーク全体図NEC Confidential4-7

Tsubame2.0 Efficient Cooling InfrastructureHP’s water-cooled rackCompletely closed racks with their ownheat exchanger.1.5 x width of normal rack rear ext.Cooling for high density deployments35kW of cooling capacity single rack Highest Rack Heat Density ever 3000CFM Intake airflow with 7Cchiller waterup to 2000 lbs of IT equipmentUniform air flow across the front of theserversAutomatic door opening mechanismcontrolling both racksAdjustable temperature set pointRemoves 95% to 97% of heat insideracksNEC ConfidentialPolycarbonate front door reduces Entire EarthSimulator(rack 50TF)

TSUBAME2.0 Software Stack (Red: R&D at Tokyo Tech)GPU Enabled OSS and ISV SW: Amber, Gaussian(2011)BLAST,GhostM Programming Environment GPU)CUDA4.0, OpenCL,PGI C/Fortran, (CAPS C/Fortran)(YYYY C/Fortran),Grid MiddlewareMATLab, Mathematica, Physis,-NAREGI，Globus, Gfarm2Resource Scheduler and Fault TolerancePBS professional (w/GPU extensions), Windows HPC ServerGPU LibrariesCUDA Lib, CULA,NUFFT, Message PassingOpenMPI,MVAPICH2w/GPU Directx86 CompilerPGI, Intel，TotalView Debugger(GPU/CPU)FileSystem: Lustre,GPFS, GFarm2NFS,CIFS，iSCSI,Fault ToleranceBLCR, NVCR,FTI checkpointSystem Management- User Management- Accounting- Data Backup- System Management(AutonomicOperation, PowerManagement)- System MonitoringOperating Systems/Virtual MachineSUSE Linux Enterprise Server, WindowsHPC Server, KVMDriver (Voltaire OFED/InfiniBand, CUDA4.0 Driver)Server and Storage PlatformHP ProLiant SL390z G7，DL580G7，NVIDIA Tesla M2050/2070, VoltaireInfiniBand) DDN DFA10000, Oracle SL8500,

TSUBAME2.0 As of Dec. 12, 2011 2000 SC Users 87% System Utilization 50% GPU UtilizationS, H/XVC C CKVMVirtualization C C CIOHGPUGPUGPUIOH76.8GFlops C C C QPI 18GB/sC C CPCIex16HCAHCA515GFlops 120GB/sw/ECCGQDR IBx2 8GB/s

Tsubame 2.0’s AchievementsASUCA Weather145TeraFlopsWorld RecordDendriteCrystallization2.0 PetaFlops2011 GordonBell Award!!FMMTurbulance1 PetaflopBlood Flow 600TeraFlops2011 Gordon BellAward HonorableMentionOver 10 Petascale Applications4th Fastest Supercomputer inthe World (Nov. 2010 Top500)Fruit of Years ofCollaborative Research –Info-Plosion, JST CREST,Ultra Low Power HPC x66,000fasterx3 powerefficient

TSUBAME2.0 World Rankings(Nov. 2010 Announcement Green500!!!)The Top 500 (Absolute Performance)#1： 2.5 PetaFlops: China Defense Univ. Dawning Tianhe 1-A#2： 1.76 Petaflops: US ORNL Cray XT5 Jaguar#3： 1.27 PetaFlops: China Shenzen SC Nebulae#4： 1.19 PetaFlops: Japan Tokyo Tech. HP/NEC TSUBAME2.0#5： 1.054 PetaFlops: US LLBL Cray XE6 Hopper# 33 (#2 Japan)： 0.191 Petaflops：JAEA FujitsuThe Green 500 (Performance/Power Efficiency)#1: 1684.20 : US IBM Research BG/Q Prototype (116)#2: 958.35: Japan Tokyo Tech/HP/NEC Tsubame 2.0 (4)#3: 933.06 : US NCSA Hybrid Cluster Prototype (403)#4: 828.67: Japan Riken “K” Supercomputer Prototype (170)#5-7: 773.38: Germany Julich etc.IBM QPACE SFB TR (207-209)(#2 1448.03: Japan NAO Grape-DR Prototype) (383) (Added in Dec.)TSUBAME2.0 “Greenest Production Supercomputer in the World”Nov., 2010, June 2011 (two in a row!)

Petaflops? Gigaflops/W?x66,000fasterx3 powerefficient x44,000 DataLaptop: SONY Vaio type Z (VPCZ1)CPU: Intel Core i7 620M (2.66GHz)MEMORY: DDR3-1066 4GBx2OS: Microsoft Windows 7 Ultimate 64bitHPL: Intel(R) Optimized LINPACK Benchmark forWindows (10.2.6.015)256GB HDDSupercomputer: TSUBAME 2.0CPU: 2714 Intel Westmere 2.93 GhzGPU: 4071 nVidia Fermi M2050MEMORY: DDR3-1333 80TB GDDR5 12TBOS: SuSE Linux 11 Windows HPC Server R2HPL: Tokyo Tech Heterogeneous HPL11PB Hierarchical Storage18.1 GigaFlops Linpack369 MegaFlops/W1.192 PetaFlops Linpack1043 MegaFlops/W

Example Grand Challenge, PetascaleApplications on TSUBAME2.0 in 2011 ( 10 apps)PetaFLOPS Phase-FieldSimulation (Aoki)Q criteria in isotropicturbulenceMetal dendritic solidification is simulatedPhase-field Model:Time integration of Phase-fieldMobilityInterface anisotropyAccesspatternVortex methodwith fast multipolemethod (FMM) isusedEntropy of fusionUndercoolingTime integration of solute concentrationDiffusion coeff. insolid and liquidEfficiency in weak scaling(4x106 particles per proc)Multiphysics BiofluidicsSimulation (Bernaschi )BLASTX for Millions ofDNAseq (Akiyama)Turbulance Simulationusing FMM (Yasuoka)MetagenomeAnalysisfor Bacteria in SoilSimulation of blood flows thataccounts for from red blood cells toendothelial stressData: 224millionDNA reads(75b) /setPre-filtering:reduces to 71millionreadsBLASTX: 71million DNA vs.nr-aa DB (4.2G)Red blood cells asellipsoidal particles#CPU core versusThroughput60Mreads/hourGPU(SP)2.00 PFlops in SPGPU(DP) using 4,000 GPUs402000Breakdown of exec time10000#CPU Cores#GPU versusThroughputMreads/hour60Dendritic growth in thebinary alloy solidification(Mesh:768 1632 3264)40GHOSTM60.6 Million/hourwith 2,520 GPUs20001.0PFlops with 4,000GPUsSC11 Gordon Bell Winner Fastest FMM ever? Tech Paper, fastestCan hit 1PF w/Stencil ever4000 GPUs1000BLASTXMultiphyics simulation with MUPHY software24.4 Million/hourFluid: blood plasma Body: RBCwith 16008 CPUcoresLatticeExtended MDBoltzmann(LB)200002000GHOSTM is ouroriginal CUDA appalmost compatible to3000 BLASTX.# GPUCan cope next genGiga Sequencers0.6PFlops with 4,000GPUs450M RBCs are simulatedSC11 Gordon BellHonorable MentionFastest LBM ever?

BackgroundMechanical StructureMaterial MicrostructureLow-carbon societyImprovement of fuel efficiencyby reducing the weight oftransportation and mechanicalstructuresDeveloping lightweightstrengthening material bycontrolling microstructureDendritic Growth

Impact of Peta-scale Simulation on Material SciencePrevious Research2DPeta-scale Simulation GPU-rich Supercomputer Optimization for Peta-scale computing3D simple shapeDistribution of multiple dendrites isimportant for design of solidifiedproducts.Single dendriteScientific meaningful3D simulation24

Large-scale Phase-field Simulation4096 x 1024 x 4096 (periodic boundary)(Special thanks to Mr. Kuroki for 3D rendering.)25

Weak scaling results on TSUBAME 2.0 GPU-Only（No overlapping） Hybrid-YZ4096 x 6400 x 128004000 (40 x 100) GPUs16,000 CPU cores （y,z boundary by CPU）Hybrid-Y（y boundary by CPU）Hybrid-Y method2.0000045 PFlopsGPU: 1.975 PFlopsCPU: 24.69 TFlopssingle precision Efficiency 44.5%(2.000 PFlops / 4.497 PFlops)Mesh size: 4096 x160x128/GPUNVIDIA Tesla M2050 card / Intel Xeon X5670 2.93 GHz on TSUBAME 2.026

Power consumption and Efficiency The power consumption by application executionson TSUBAME 2.0 is measured in detail.Our phase-field simulation (real application) 2.000 PFlops (single precision) 2PFlops-Simulation Performance to the peak: 44.5% 1.36 MW Green computing: 1468 MFlops/W (Total:1729kW)We obtained the simulation results by smallelectric power consumption.Ref.Linpack benchmark 1.192 PFlops (DP) Efficiency 52.1% 827.8 MFlops/W27

Power Consumption during2.0PFlops Phase-Field Run2011/10/5 2:00—3:00GP GPUCompute node:1362kWStorage:73kWCooling:294kW at maxTotal:1729kW at maxToyotaro Suzumura, Koji Ueno, Tokyo Institute of TechnologyCopyright Takayuki Aoki / Global Scientific Information and Computing Center, Tokyo Institute of Technology

MUPHY: Multiphysics simulation of bloodflow(Melchionna, Bernaschi et al.)Combined Lattice-Boltzmann (LB)simulation for plasma and MolecularDynamics (MD) for Red Blood CellsRealistic geometry ( from CAT scan)Multiphyics simulationwith MUPHY softwareFluid: Blood plasmaLattice BoltzmannIrregular mesh isdivided by using PTSCOTCH tool,considering cutoffdistanceBody: Red blood cellcoupledExtended MDRed blood cells(RBCs) arerepresented asellipsoidal particlesTwo-levels of parallelism: CUDA (onGPU) MPI 1 Billion mesh node for LBcomponent 100 Million RBCs

CARDIOVASCULAR HEMODYNAMICSA topic with enormous impact on society

Plaque rupture is followed by flowinterruption and leads to heart attack.This is the first cause of mortalityin western society.It is essential to forecast where and whenplaques formThe only possibility to access thepatient-specific risk map (shear stress patterns)is through computingthe complete arterial geometry!

Results on Tsubame2 Supercomputer(1)Cluster of Nvidia M2050 GPUs connected by QDR Infiniband.Scaling study up to 512 nodes (each node has 3 GPUs).Very fast parallel I/O (read 100 GB in 10 sec)1 billion mesh nodesGPUsTime (s)Efficiency2560.07616N.A.5120.0385298.86 %1,0240.0199595.37 %1,5360.0134394.43 %Lattice Boltzmann Scaling(time per step)LB kernel:1 GPU 200 BG/P cores1536 GPUs equivalent to full BlueGene/P1 billion mesh nodes 100 millionRBCGPUsTime 1406279.03%Lattice Boltzmann Cell Dynamics Scaling(time per step)Time to completion on stationary flow:23 minutesNew run on FULL TSUBAME2.0 (4000 GPUs) justcompleted with an improved algorithm, exhibitingpetascale performance(!)

Results on Tsubame2 Supercomputer(2) : Using 4,000 GPUsStrong Scaling ResultsElapsed time per timestep for 1G mesh nodes and450M RBCs (log scale)Parallel efficiency for 110, 220, 450M RBCs 80% with4K GPUsSpeeds per Component0.6PFlops with 4,000GPUsfor 1G mesh nodes, 450M RBCsA complete heartbeat atmicrosecond resolution can besimulated in 48hours

Fatalities: 19,508 Strong shakings anddevastating tsunamis Large source area 500km x 200 km Inner black rectangle Large FDM regionrequired 960km x 480km in horizontal 240km in depth Outer red rectangleAftershock Distribution

FDTD Simulation of Wave Propagation Finite-Difference TimeDomain (Okamoto et al. 2010) Topography, ocean layer, and heterogeneityGrid size: 6400 x 3200 x 1600Grid spacing: 150 mTime interval: 0.005 s1000 GPUs of TSUBAME-2.0 Preliminary source model Main part of the FDM regionVisualization Vertical ground motion onland ocean bottom

FDTD Simulation of Wave PropagationMain part of the FDM region

Power Consumption during 700-node RunCompute nodes (partial)Compute node:903kW in total550kW for This app(estimate from 540nodes)Chiller (shared by all jobs)Storage:72kWCooling:345kW at max(shared by all jobs)Total:1320kW at max

Next GenerationNumerical Weather Prediction[SC10]Collaboration: Japan Meteorological AgencyMeso-scale Atmosphere Model:Cloud Resolving Non-hydrostatic model[Shimokawabe et. al. SC10 BSP Finalist]ex. WRF(Weather Research and Forecast)Typhoon 1000km1 10kmTornado,Down burst,Heavy RainWSM5 (WRF Single Moment 5-tracer) Microphysics*Represents condensation, precipitation and thermodynamic effects of latentheat release1 % of lines of code, 25 % of elapsed time 20 x boost in microphysics (1.2 - 1.3 x overall improvement）ASUCA : full GPU ImplementationBlock Divisionfor Advection nydeveloped by Japan Meteorological verlappingnzfor 1-DHelmholtz eq.nxTSUBAME 2.0 : 145 TflopsWorld Record !!!6956 x 6052 x 48528 (22 x 24) GPUsnyCPUnzTyphoon

Mesoscale Atmosphere Model ASUCAMesoscale 5kmAtmosphereModel ASUCAHorizontalResolutionHorizontal 5km Resolution (Present)Mesoscale Atmosphere Model ASUCAMesoscale 5kmAtmosphereModel ASUCAHorizontalResolutionHorizontal 500m Resolutionx1000

TSUBAME 2.0 PerformanceWeak Scaling145.0 TflopsSingle precision76.1 TflopsDoublele precisionFermi core TeslaM20503990 GPUsPrevious WRF Record on ORN Jaguar 50 TFLOPS (DFP)x10 Socket-Socket41

Power Consumption duringFull TSUBAME2 Test with ASUCAASUCA Run14001200Power (kW)1000600Nodes 262011/04/08 2:18—2:26Compute node:960kWStorage:78kWCooling:270kW maxTotal:1308kW max

100-million-atom MD SimulationM. Sekijima (Tokyo Tech), Jim Phillips (UIUC) NAMD is a parallel molecular dynamics code developed atUniversity of Illinois. This evaluation is result of an interdisciplinary collaborationbetween UIUC and Tokyo Tech. The 100-million-atom benchmark in this work was assembledby replicating a million-atom satellite tobacco mosaic virus(STMV) simulation on a 5x5x4 grid. One STMV (Satellite Tobacco Mosaic Virus) includes 1,066,628atoms.

100-million-atom MD Simulationns / dayM. Sekijima (Tokyo Tech), Jim Phillips (UIUC)Performance Evaluation

100-million-atom MD SimulationM. Sekijima (Tokyo Tech), Jim Phillips (UIUC)

100-million-atom MD SimulationPower Consumption during 700-node RunCompute nodes (partial)Compute node:1115kW in total706kW for This app(estimate from 540nodes)Chiller (shared by all jobs)Storage:72kWCooling:340kW max(shared by all jobs)Total:1527kW maxM. Sekijima (Tokyo Tech), Jim Phillips (UIUC)

Petaflops scale turbulence simulation on TSUBAME2.0Isotropic turbulencePseudo Spectral Method(2/3 dealiasing)Re : 500N : 20483Vortex Particle Method(Reinitialized CSM)Re : 500N : 204838 billion particlesR. Yokota (KAUST) , L. A. Barba (Boston Univ), T. Narumi(Univ of Electro Communications), K. Yasuoka (Keio Univ)

Petaflops scale turbulence simulation on TSUBAME2.0Weak ScalingWall clock timeParallel efficiencyR. Yokota (KAUST) , L. A. Barba (Boston Univ), T. Narumi(Univ of Electro Communications), K. Yasuoka (Keio Univ)

Petaflops scale turbulence simulation on TSUBAME2.0Present work64 billion in 100 seconds1.0 PFlopsRahimian et al. (2010 Gordon Bell)90 billion in 300 seconds0.7 PFlopsR. Yokota (KAUST) , L. A. Barba (Boston Univ), T. Narumi(Univ of Electro Communications), K. Yasuoka (Keio Univ)

Petaflops scale turbulence simulation on TSUBAME2.0Power Usage during Full SystemTest2011/10/4 5:00—6:00Compute node:1190kWStorage:72kWCooling:240kWTotal:1502kWR. Yokota (KAUST) , L. A. Barba (Boston Univ), T. Narumi(Univ of Electro Communications), K. Yasuoka (Keio Univ)

Large-Scale Metagenomics[Akiyama et. al. Tokyo Tech.]Combined effective use of GPUs and SSDs on TSUBAME2.0.Results on TSUBAME2.0BLASTX: 24.4M/hour with 16K cores#CPU core me analysis: study of the genomes ofuncultured microbes obtained from microbialcommunities in their natural habitatsCollecting bacteria in soilData: 224million DNA reads(75b) /setPre-filtering: reduces to 71M readsSearch: 71M DNA vs. NCBI nr-aa DB (4.2GB)10000#CPU Cores20000GHOSTM: 60.6M/hour with 2520 GPUsMreads/hourTwo homology search tools are available:1) BLASTX, standard software on CPUs2) GHOSTM, our GPU-based fast softwarecompatible with BLASTX0#GPU versus Throughput706050403020100010002000# GPUIt would bemorescalable withlarger data3000sets

Graph500 on TSUBAME 2.0Kronecker graphA: 0.57, B: 0.19C: 0.19, D: 0.05 Graph500 is a new benchmark that rankssupercomputers by executing a large-scale graphsearch problem. The benchmark is ranked by so-called TEPS(Traversed Edges Per Second) that measures thenumber of edges to be traversed per second bysearching all the reachable vertices from onearbitrary vertex with each team’s optimized BFS(Breadth-First Search) algorithm.Toyotaro Suzumura, Koji Ueno, Tokyo Institute of Technology

Highly Scalable Graph Search Method forthe Graph500 2768# of nodes1024Performance Comparison with Reference Implementations(simple, replicated-csr and replicated-csc) and Scale 24 per 1 node25optimizedsimple2D Partitioning Optimization20TEPS (GE/s)Vertex Sorting by utilizing the scalefree nature of the Kronecker GraphPerformance of Our Optimized Implementationwith Scale 26 per 1 nodeTEPS (GE/s) An optimized method based on 2D basedpartitioning and other various optimizationmethods such as communicationcompression and vertex sorting. Our optimized implementation can solve BFS(Breadth First Search) of large-scale graphwith 236（68.7 billion）vertices and 240（1.1trillion）edges for 10.58 seconds with 1366nodes and 16392 CPU cores on TSUBAME 2.0103.9 GE/s (TEPS) #3 Graph 500 Nov. 2011replicated-csrreplicated-csc1510500326496# of nodesToyotaro Suzumura, Koji Ueno, Tokyo Institute of Technology128

Power Consumption duringGraph500 Run on TSUBAME 2.02011/10/4 18:00—22:00Compute node:902kWStorage:75kWCooling:346kW maxTotal:1323kW maxToyotaro Suzumura, Koji Ueno, Tokyo Institute of Technology

TSUBAME2.0 Power Consumption withPetascale ApplicationsCompute nodes& Total7507223098023.5%550/90372345132026.1%NAMD MD (2000GPUs)706/111572340152722.3%ASUCA Weather96078270130820.6%Turbulence e-field136273294172917.0%GPU DGEMM1538724102020Linpack (Top500)141772--20.3%-Typical ProductionEarthquake(2000 GPUs)

HPC Programming Model TrendCore 0MPI?Core 1MPI OpenMPMPI CUDA Conceptually getting more complicated Needs deep understanding of architecture MPI X New architecture New “X”

Example: Stencil ComputationPc’ (Pc Pn Ps Pw Pe) *1/5.0

Physis (Φύσις) Framework [SC11]Physis (φύσις) is a Greek theological, philosophical, and scientific term usuallytranslated into English as "nature.“ (Wikipedia:Physis)Stencil DSL Declarative Portable Global-view C-basedvoid diffusion(int x, int y, int z,PSGrid3DFloat g1, PSGrid3DFloat g2) {float v PSGridGet(g1,x,y,z) PSGridGet(g1,x-1,y,z) PSGridGet(g1,x 1,y,z) PSGridGet(g1,x,y-1,z) PSGridGet(g1,x,y 1,z) PSGridGet(g1,x,y,z-1) PSGridGet(g1,x,y,z 1);PSGridEmit(g2,v/7.0);}DSL Compiler Target-specific Physiscode generationand optimizations AutomaticparallelizationCC MPICUDACUDA MPIOpenMPOpenCL

ProductivityIncrease of Lines of Code108Original6MPIPhysis4Generated (No Opt)2Generated (Opt)0Diffui onHimenoSeismicSimilar size as sequential code in C

Optimization EffectsDiffusion Weak Scaling Performance300Performance (GFLOPS)250Baseline200Overlapped boundaryexchange150 Multistream boundarykernels100Full optManual500256x256x256256x256x512256x256x1024

Diffusion Weak ps6000500040003000200010000050100150200Number of GPUs250300

Seismic Weak ScalingProblem size: 256x256x256 per GPU450400350GFlops300250200150100500050100Number of GPUs150200

Summary Tsubame2.0 a year later since Nov. 2010–––––Over 2000 users, 100 users online 90% system util, 50% GPU utilSystem up 24/7, tolerated 3/11 disasterVery power efficient, 3/4 Tsubame 1.0Collaborative R&D really paid off––––––2011 Gordon Bell x 22010-2011 Greenest Production SC Green5002010-2011 3 consecutive Top5 in Top5002011 #3 Green 500Lots and lots of publications incl. 4 SC2011 papersMany many more awards and press Accolades But most importantly, it works!

Tsubame2.0 (2010-14) x30 speedup c.f. Tsubame 1 (2006-2010) . - Data Backup - System Management (Autonomic Operation, Power Management) - System Monitoring . Breakdown of exec