March 2018 WG Town Hall - WestGrid

Transcription

WestGrid Town Hall:March 23, 2018Patrick Mann, Director of Operations

AdminTo ask questions: Websteam: Email info@westgrid.ca Vidyo: Click the “hand” icon ask your questionVidyo Users: Please MUTE yourself when not speaking(click the microphone icon to mute / un-mute)

Outline1.2.3.4.5.6.7.Welcome new CC Interim CEO - Robbin TourangeauSciNet CTO Danny Gruner - New Large Parallel SystemSystem updates: Cedar expansion, Niagara launchLegacy system migrationRAC 2018 resultsCC account renewals (CCV requirement)Upcoming training opportunities

Compute Canada UpdateRobbin Tourangeau, Interim CEOCompute Canada CEO Announcement

Niagara Large ParallelDaniel Gruner, SciNet CTOhttps://www.scinethpc.ca/staff/

NiagaraSystem Specifications1500 nodes (2x20 core Intel Skylake @ 2.4GHz)21 compute, 3 IB, 4 storage, 2 management racks60,000 cores total192GB Ram per nodeEDR Infiniband (Dragonfly )5PB Scratch, 5 2 PB Project (GPFS)256 TB Burst Bu er (Excelero/GPFS)Rpeak of 4.61PF (GPC: 312TF, BGQ: 839TF, Graham:2.6PF, Cedar: 3.7PF)Rmax of 3.0 PF685 kW

Niagara - Compute NodesLenovo SD530 NodeIntel Skylake 6148 Gold (2.4 GHz, AVX512)192GB Ram (150 GB/s Memory Bandwidth)3TFlops/node100 Gb/s EDR IBstateless (diskless)

Niagara - High Speed NetworkDragonfly NetworkIntroduced by Kim et al (2008) aiming to decreasecost/diameter of network.Uses groups of high radix virtual routes to create a completelyconnected topology.Less expensive and more scalable than 1:1 Fat-tree with closeto the same performance.Topology of Cray XC40/50 “Aries” networkRequires only edge switches, no core-switchesAdaptive RoutingCongestion Controlnew for Infiniband (requires ConnectX-5, Switch IB2)

Niagara - Dragonfly Topology

Niagara - Dragonfly Topology

Niagara - StorageStorage3x Lenovo DSS G260 (504x10TB)10 PB Regular Disk70-90 GB/s R/WSpectrum Scale (GPFS)Burst Bu er256TB burst bu er in Raid 110 nodes with (8x6.4TB NVMe SSD)Excelero NVMe Fabric160 GB/s R/Wvery high IOPs performanceSpectrum Scale (GPFS)

Niagara - Burst Bu er - Excelero NVMesh

Niagara - Burst Bu er - Excelero NVMesh

NiagaraSoftware ConfigurationCentos 7CC LDAPCC software stack available specialized stackSlurm Scheduler - by node onlytarget is large jobs (1024 core)

NiagaraGPC vs orkStorageOSRmax/RpeakPowerGPC200945Intel “Nehalem”SSE4.2816GB30,2405:1 QDR2 PBCentos 6261/312 TF1000 kWNiagara201821Intel “Skylake”AVX51240192GB60,000Dragonfly EDR10PB BBCentos 73.0/4.61 PF685 kW

NiagaraMigration DetailsHPSS Archive (Tape) stays as isExisting GPC HOME, SCRATCH, PROJECT will migrate2018 RAC allocations are for Niagara

NiagaraNiagara TimelineRFP Process (Jan - Sep 2017)Negotiation & Contracts (Aug - Oct 2017)TCS Decommission Oct 20171/2 GPC Decommission Nov 2017Deployment (Dec 2017 - Feb 2018)Test/Config (March 2018)Final Storage (March 2018)Production (April 2018)

WestGrid UpdatesPatrick Mann, WestGrid Director ofOperations

National Compute SystemsSystemCurrentStatusCedar(GP2, SFU) 27,696 CPU cores146 GPU nodes (4 x NVidia P100)Storage and Compute expansion in progress (2x!)Hardware delivered and being installedIN OPERATION (June 30, 2017)RAC 2018 (expansion may not beavailable until mid-April)Graham(GP3, Waterloo) 33,472 CPU cores160 GPU nodes (2 x NVidia P100)Storage expansion in progressIN OPERATION (June 30, 2017)RAC 2018Niagara(LP1, Toronto) New large parallel system 60,000 coresInstalled and currently in testingApril 2018RAC 2018GP4, CQ New General Purpose system2019Not included in RAC 2018HPC4Health(Sick Kids / UHN) http://www.hpc4health.ca/Elastic Secure Cloud, 7,000 coresIn operation as local resource.National service in preparation.

National Cloud SystemsSystemArbutus (UVic)Current Status290 nodes - 7,640 cores (including original west.cloud)100G networkAdditional storage purchase in progress (Ceph for Block Storage) Current 1.5PB raw, increase by 2 PB (3x replicated) Quotes received, PO in preparation. 1,500 additional cores plannedIN OPERATION(Sep, 2016) Cedar CloudPartition (SFU) 10 nodes with 2x16 cores/node, 500 TB usable Ceph storageElastic - could expand to 48 nodes as necessaryUnder developmentGraham CloudPartition(Waterloo) IN OPERATION 10 nodes with 2x16 cores/node, 256 GB, 100TB usable CephstorageElastic - could expand to 53 nodes as necessary.east.cloud(Sherbrooke) 36 nodes with 2x16 cores/node, 128 GB, 100TB usable CephIN OPERATION(Sep 15, 2017)Available on request(Sep, 2016)

OutagesCloud High Availabilityissues*Feb/MarMajor issues with complex cloud network HAfunctionality.SchedulingAlways!Nothing major but the usual issues with scheduling awide mix of jobs into saturated systems.*The Arbutus (west) cloud experienced network stability issues starting Friday Feb 16 and wereintermittent until the resolution on Tuesday Feb 27. The network instability issues were caused by a bugin OpenStack's High-Availability (HA) router implementation. To resolve this, we added additional routercapacity and migrated all project routers to non-HA configurations. The routers have been very stable nowsince the initial work was completed on the evening of Tuesday, February 27. On Wednesday March 14,we converted the remaining routers to non-HA and did not experience any issues.

Arbutus UsageJan 1 to Mar 19 20187,640 physical cores with2x hyperthreadingWeek 03: Spectre/Meltdownpatching.Week 09: Feb/Mar Network HAissue.

Cedar CPU UsageNov.18: major cedar outage forpower upgrade and schedulerupgrade, OSOtherwise running at about 83%capacity.24,192 cpu cores(expansion April 2018)

Cedar GPU Usage584 GPUs. 4 x NVidia P100’sper node.Very variable usage. Onlyabout 50%.RAC 2018 request 5x.Expect 2018 ramp-up as usersbecome more familiar.Note: most use is frompackages.

Legacy Systems &Migration

Continuing SystemsSystemDefundingPlansOrcinusMarch 31,2019Continuing for 1 more year.RAC “bonus”.ParallelMar.31,2018Continued by UCalgary for UCalgaryusers.Expected to use UCalgary accountsMar.31,2018Continued by UManitoba forUManitoba users.Probably CC LDAP and accountsUManitoba local “RAC”DefundedMay be continued by UofA for UofAusers.Probably CC LDAP and accountsAll are “best effort”.GrexJasper/HungabeeWG Web pagesContinue as resource forinstitutional continuation.DetailsAsk local IT or HPC services

Legacy System Availability20132014201520162017NotesUnscheduled SystemOutages (core-hours)2.1%2.5%2.5%0.87%2017: No major outagesScheduled SystemOutages e/jasper OS/Lustre upgradesBugaboo electrical work in preparation for GP297.4%95.2%97.0%98.9%(2013: 94% due to major upgrades)Overall Availability 94%2017 includes only full-year systems: orcinus, grex, parallel Other systems were defunded/decommissioned during the year. Hungabee was defunded - special system which really pulled down annual uptimes.NO MAJOR OUTAGES In 2016 big (multi-week) outages due to disc/storage failures. Grex and Parallel have relatively new storage systems (grex:2017, parallel:2016)(Due to CC monitoring issues legacy statistics are not reliable. Expect usage to be around 85%)

Legacy User DataIMPORTANT: Data on defunded systemswill be deleted 1 MONTH after the defunding date.(April 1 for Bugaboo, April 30 for Grex & Parallel)WestGrid will not retain any long term or back-up copies of user data andusers must arrange for migration of their data.(/project space on Cedar or Graham are recommended storage locations)WestGrid Migration Details: https://www.westgrid.ca/migration process

RAC 2018Resource AllocationCompetitionRAC 2018

RAC 2018RAC 2018Award lettersRAC 2018 Allocations implemented DateLate March 2018Apr 4, 2018Cedar expansion is a few days late and may not be available on Apr.4. Cedar will be operational and RACs will be on cedar. Expansion nodes may not all be installed so actual share will besmaller for a short period.Niagara is currently on-schedule.

RAC 2018 Stats (Preliminary)ClassAllocatable (100%)RequestedAllocated% of Requests Allocated185,404287,457140,71249%CPU Memory (RAM, TB)*740930N/AN/AGPU (GPU’s)9764,09284020.5%Project Storage (PB)54322975%Cloud compute (vCPU)17,9208,5567,99358%Cloud persistent (vCPU)8,3843,9063,83856%4.554.285%CPU (CY)Cloud Storage (PB)*Job Accounting: Charged for Core-Year-Equivalents (CYE). Large memory jobs get charged morebecause they use a larger share of resources. Please be as realistic as possible when requestingmemory.

NearlineStill working on the “Nearline” system (self-serve tape archiving) Investigating automated HSM (Hierarchical Storage Management)Nearline needs integration with the Backup utilityTurning out to be much tougher than expected.No RAC 2018 Nearline allocations Ask support@westgrid.ca if you want help with tape backup, nearline, grouprequests, shared storage, .

General Updates

Account RenewalsCCDB accepting account renewals starting March 23 Notification emails will be sent out. DEADLINE: April 23.Purpose Keep grad student, postdoc, etc. accounts up-to-date Stats for reporting to funders (CFI in particular)CCV (Common Curriculum Vita) Update capability now available Very important for our reporting - research output, FWCI, .

WestGrid Online Sessions*NEW* Bi-weekly training webinars: until May 23, 2018 Every second Wednesday 10am Pacific / 11am Mountain / 12pm Central!EEFRDATETOPICSpeakerMarch 28Visualization Series: Scientific visualization with PlotlyAlex RazoumovApril 11Bioinformatics Series:Designing and implementing a variant databaseMya Warren (Genome Sciences Centre)April 25Digital Humanities topic (TBA)John SimpsonMay 9Tools for Automating Analysis PipelinesJamie RosnerMay 23Visualization Series topic (TBA)Alex RazoumovWatch: www.westgrid.ca/training

Other Training SessionsDATE & LOCATIONTOPICDETAILSApr 30 & May 1 - ONLINESoftware Carpentry 018-04-30-ttt-canadaMay 11 - SFUCedar Introduction WorkshopMay 16-18, 2018 - UNBCIntro to HPCParaView, Python, ChapelUNBC (Prince George)May 28-30 - UofASpring Training at UofAHPC Carpentry & ParaView visualization courses as partof Research Data Management eventsJune 4-15 - UVicDHSI Victoriahttps://dhsi.org (many CC courses)June 11-14 - UBCWG Summer School at UBCRegistration opens in AprilJune 25-28 - UManWG Summer School at UManitobaRegistration opens in AprilJuly 3 - UofAParaViewUofA Faculty of Engineering Grad Research SymposiumOmics @ SFU

! User Training ials/ Videos, slides,hands-on exercises &other materials frompast training sessions Links to other guides,documentation &upcoming events

SupportContact us putecanada.ca

10 nodes with 2x16 cores/node, 500 TB usable Ceph storage Elastic - could expand to 48 nodes as necessary Under development Graham Cloud Partition (Waterloo) 10 nodes with 2x16 cores/node, 256 GB, 100TB usable Ceph storage Elastic - could expand to 53 nodes as necessary. IN OPERATION (Sep 15, 2017) Available on request east.cloud