An Overview On Cyclops-64 Architecture

Transcription

An Overview on Cyclops-64 Architecture- A Status Report on the Programming Modeland Software InfrastructureGuang R. GaoEndowed Distinguished ProfessorElectrical & Computer EngineeringUniversity of ppt1

Outline IntroductionMulti-Core Chip TechnologyIBM Cyclops-64 Architecture/SoftwareCyclops-64 Programming Model andSystem Software Future Directions Summary2007/6/14SOS11-06-2007.ppt2

TIPs of compute power operating on Tera-bytes of dataTransistor Growth in the near futureSource: Keynote talk in CGO & PPoPP 03/14/07 by Jesse Fang fromIntel2007/6/14SOS11-06-2007.ppt3

Outline IntroductionMulti-Core Chip TechnologyIBM Cyclops-64 Architecture/SoftwareProgramming/Compiling for Cyclops-64Looking Beyond Cyclops-64Summary2007/6/14SOS11-06-2007.ppt4

Two Types of Multi-CoreArchitecture Trends Type I: Glue “heavy cores” together withminor changes Type II: Explore the parallel architecturedesign space and searching for mostsuitable chip architecture models.2007/6/14SOS11-06-2007.ppt5

Multi-Core Type II New factors to be considered– Flops are cheap!– Memory per core is small– Cache-coherence is expensive!– On-chip bandwidth can be enormous!– Examples: Cyclops-64, and others2007/6/14SOS11-06-2007.ppt6

Flops are Cheap!An example to illustrate design tradeoffs: If fed from small, local register files:– 3200 GB/s, 10 pJ/op64-bit FPunit(drawn toscale)– 1/Gflop (60 mW/Gflop)a 64-bit FPU is 1mm 2and 50pJCan fit over 200 on a chip.14mm x 14mm chip(130nm and 1GH) If fed from global on-chip memory:– 100 GB/s, 1nJ/op– 30/Gflop (1W/Gflop) If fed from off-chip memory:– 16 GB/sCurtsey: Numbers are due toSteve Scott [PACT2006 Keynote]– 200/Gflop (many W/Gflop)500 FPUs on a chip Is possible![M. Denneau: private communication]2007/6/14SOS11-06-2007.ppt7

Examples of Type-II Architectures– Intel 80-core Terascale Chip & Larrabeemini-cores chip– IBM 160-core Cyclops-64 chip– ClearSpeed 96-core CSX chip– Cisco 188-core Metro Chip– Many others are coming2007/6/14SOS11-06-2007.ppt8

Outline IntroductionMulti-Core Chip TechnologyIBM Cyclops-64 Architecture/SoftwareProgramming/Compiling for Cyclops-64Future DirectionsSummary2007/6/14SOS11-06-2007.ppt9

Cyclops-64 (C64) SupercomputerProcessor(2 Threads)1Gflops60KB SRAMBackPlane(48 Boards)3.84TFlops / 48GBBoard(1 Chip)FPU80GflopsTUTUSPSPGMGMCabinet(3 BackPlanes)11.52TFlops/144GB1GB DRAMOther DevicesDisk1GB DRAMIntra chip NetworkProcessorC64Chip3x83x8I CacheChip(80 Processors)80Gflops4.7MB SRAM12 x 8C64 System(96 t10

10,000 Feet High System OverviewFront-endBack-endC64 onnetworkCurtsey from E.T. International Inc2007/6/14SOS11-06-2007.ppt11

1,000 Feet High System OverviewHostnetworkMonitoring nodesControl networkFront-end clusterGig EthernetLogin nodesAdmin nodesC64 computing engine

100 Feet High System OverviewMonitorClusterApplicationdevelopment executionSysteminitialization HardwaremonitoringETI provides all system softwarefor the C64 supercomputer: Boot up the system Monitor the status of HW &SW System resource manager Job scheduler Toolchain for applicationdevelopment Etc.AdminJob schedulerC64Applications C64microkerneland libraries

C-64 Chip ArchitectureOn-chip bisection BW 0.38 TB/s, total BW to 6 neighbors 48GB/sec2007/6/14SOS11-06-2007.ppt14

TiNy-Thread – The API ofA Cyclops-64Thread Virtual Machine Multi-chip multiprocessor extension of thebase C64 ISA. Runs directly on top of C64 HWarchitecture. Takes advantage of C64 HW features toachieve high scalability. Three components: thread model, memorymodel and synchronization model.2007/6/14SOS11-06-2007.ppt15

Cyclops-64 Thread ModelTiNy Threads (TNT) Software thread maps directly to TU. Thread execution is non-preemptive.Sleep on wait but don't preempt; no context switch. Wakeup thru interrupt or wakeup signal. Each thread controls a region of SPM, allocated atboot time. Leverage familiar POSIX thread programminginterface.Thread creation: 280 Fast thread creation and reuse.cycles.2007/6/14SOS11-06-2007.pptThread termination: 60cycles.Thread reuse: 265 cycles.16

C-64 Memory Consistency Model The On-Chip SRAM is sequential consistencycompliant (SCC) [ZhangEtAl05] Accesses to scratch-pad memory: no hardwarecoherence is imposed A weak memory consistency model (e.g. LCconsistency) is effectively studied as a naturalchoice for the scratch-pad 6-2007.ppt17

Cyclops-64 Software ToolchainCNet LibNewlibTNT LibA-Switch libUser ApplicationBinutilsC Compiler(Assembler, Linker, etc.)Prog. APIs (e.g. SHMEM, tsuiteC64 KernelSimulation TestbedChipMr.Clops EmulatorChipI-CacheProcessor I-CacheTUSRAMIntra-ChipNetworkIntra-Chip urateSimulation on ToolsetSOS11-06-2007.ppt18

Outline IntroductionA New Era – Multi-Core Chip TechnologyIBM Cyclops-64 Architecture/SoftwareProgramming/Compiling for Cyclops-64Future DirectionsSummary2007/6/14SOS11-06-2007.ppt19

Challenges and OpportunitiesHow to exploit the massive on-chip parallelism andbandwidth to tolerate off-chip memory bandwidth andlatency?Load 1store 164 Regs1.92TB/sLoad 2Store 1TU640GB/sSPMSRAM16KB2.5MBload 20 cycle / store 10 cycleDRAM1GB320GB/sload 36 cycle / store 18 cycle16GB/s2007/6/14SOS11-06-2007.ppt20

C64 System Software Features Explicit Segmented Shared MemoryHierarchy (hotel vs. cache) Non-preemptive multi-threading model(“nap” vs. “sleep”) Exploitation of Hardware fine-grainsynchronization support Plenty of thread units (e.g. helper threads)2007/6/14SOS11-06-2007.ppt21

A Report on Early Experience on C64 Case Study I: Programming Kernels:––––––Monte-CarloMatrix MultiplyFFTLU DecompositionDynamic programmingSCCA2 Case Study II: Mapping OpenMP on C64 Other Codes2007/6/14SOS11-06-2007.ppt22

Case Study I: C64 Programming Kernels Problem Statement:– How programmable using C64 TNTprogramming model ?– What are the set of optimizations that areeffective on C64 ?– What can be learned from this study for C64compiler writers ?2007/6/14SOS11-06-2007.ppt23

FFT: Performance with Various flops151010550001234561 – Base parallel version (2-point work unit)2 – Using 8-point work unit3 – Special approach in the first 4 stages4 – Eliminate redundant memory operations (twiddle factors)5 – Loop unrolling (bit-reversal permutation)6 – Register allocation & Instruction scheduling (manually)2007/6/14SOS11-06-2007.ppt* Theexperiments were conducted with 128 threads.20406080100120Number of threadsData size – 216 double precision1D FFTExperiments were conducted onETI Cyclops-64 Toolchain 1.6.2.ACK: Mike MerrillTo Appear: IPDPS07 Workshop24

Performance with Optimizations - LUPerformance with Optimizations (128 TUs)Execute 1024x1024 LU in SRAM20201816161414Performance ons2483264128Number of Threads1 – Base Parallel Version2 – Dyn. Repartitioning Recursion3 – Processor Adaptation4 – Hardware Barrier5 – Reg. Tiling (manually)2007/6/1416SOS11-06-2007.pptWork inProgress25

Performance of SCCA2 (kernel 482681 reasonable scalability–Scale well with # threads–Linear speedup for #threads 32 commodity SMPs has poor performance Competitive vs. MTA-2Unit: TEPS -- Traversed Edges per secondSMPs: 4-way Xeon dual-core, 2MB L2 CacheWork In Progress

Outline IntroductionMulti-Core Chip TechnologyIBM Cyclops-64 Architecture/SoftwareProgramming/Compiling for Cyclops-64Future DirectionsSummary2007/6/14SOS11-06-2007.ppt27

Outline IntroductionA New Era – Multi-Core Chip TechnologyIBM Cyclops-64 Architecture/SoftwareProgramming/Compiling for Cyclops-64Future DirectionsSummary2007/6/14SOS11-06-2007.ppt28

Summary It is the program execution model that isimportant The "software problem" is not merely aproblem to be solved by software engineersI am dismayed by the tendency to split computer scienceeducation into “programming" and "hardware" with so littlebrought in about the way they interact.- Jack B. Dennis2007/6/14SOS11-06-2007.ppt29

Some Cyclops-64 Publications Juan del Cuvillo, Weirong Zhu, and Guang R. Gao, Landing OpenMP on Cyclops-64: AnEfficient Mapping of OpenMP to a many-core System-on-a-chip, 3rd ACM InternationalConference on Computing Frontiers (CF'06), May 2 - 5, 2006. Juan del Cuvillo, Weirong Zhu and Guang R. Gao, Towards a Software Infrastructure forthe Cyclops-64 Cellular Architecture, 20th International Symposium on High PerformanceComputing Systems and Applications (HPCS2006), St. John's, Newfoundland and Labrador,Canada, May 14 - 17, 2006. Ying M. P. Zhang, Taikyeong Jeong, Fei Chen, Haiping Wu, Ronny Nitzsche, and GuangR. Gao, A Study of the On-Chip Interconnection Network for the IBM Cyclops-64 Multi-CoreArchitecture, 20th International Parallel and Distributed Processing Symposium (IPDPS2006),Rhodes Island, Greece, April 25 - 29, 2006. Yanwei Niu, Ziang Hu, Kenneth E. Barner, Guang R. Gao, Performance Modelling andOptimization of Memory Access on Cellular Computer Architecture Cyclops64. Network andParallel Computing, IFIP International Conference, Beijing, China, November 30 - December3, 2005.2007/6/14SOS11-06-2007.ppt30

Some Cyclops-64 Publications (cont’d) Juan del Cuvillo, Weirong Zhu, Ziang Hu and Guang R. Gao, FAST: a FunctionallyAccurate Simulator Toolset for the Cyclops-64 Cellular Architecture, Workshop onModeling, Benchmarking, and Simulation (MoBS2005), in conjuction with the 32ndAnnual International Symposium on Computer Architecture (ISCA2005), Madison,Wisconsin, June 4, 2005. Juan del Cuvillo, Weirong Zhu, Ziang Hu and Guang R. Gao, TiNy Threads: a ThreadVirtual Machine for the Cyclops64 Cellular Architecture, 5th Workshop on MassivelyParallel Processing (WMPP05), in conjuction with the 19th International Parallel andDistributed Processing Symposium (IPDPS2005), April 4-8, 2005 in Denver, Colorado. Yuan Zhang, Weirong Zhu, Fei Chen, Ziang Hu, and Guang R. Gao, SequentialConsistency Revisit: the Sufficient Condition and Method to Reason the ConsistencyModel of a Multiprocessor-on-a-chip Architecture, The IASTED InternationalConference on Parallel and Distributed Computing and Networks (PDCN2005),February 15-17, 2005, Innsbruck, Austria. Yuanwei Niu, Ziang Hu, and Guang Gao, Parallel Reconstruction for Parallel ImagingSpace-RIP on Cellular Computer Architecture, in proceedings of The 16th IASTEDInternational Conference on Parallel and Distributed Computing and Systems,November 9-11, 2004, MIT, Cambridge, MA, USA.2007/6/14SOS11-06-2007.ppt31

Acknowledgements 2007/6/14Our SponsorsIBM Cyclops-64 Team (Monty Denneau, et. al.)ETI Cyclops-64 TeamMembers of CAPSLOther CollaboratorsMy HostSOS11-06-2007.ppt32

Contributors to Cyclops-64 Project(Note: this is an incomplete list) Fei ChenLong ChenJuan del CuvilloBrice DobryAlban DouilletGe GanGuang R. GaoGeoffery GerfinYuhei HayashiZiang HuDimitrij KrepisJoseph ManzanoAndrew Russo2007/6/14 Hirofumi SakaneGuangming TanWesley TolandJohn TullyIoannis VenetisMatthew WellsHaiping WuLiping XueShuxin YangPeiheng ZhangYingping ZhangYuan ZhangWeirong ZhuSOS11-06-2007.ppt33

Job scheduler Monitor System initialization Hardware monitoring Cluster Application development execution ETI provides all system software for the C64 supercomputer: Boot up the system Monitor the status of HW &SW System resource manager Job scheduler Toolchain for application developm