Raghu Chandrasekar - CUG

Transcription

An Exploration into Object Storagefor Exascale SupercomputersRaghu Chandrasekar

Agenda Introduction Trends and Challenges Design and Implementation of SAROJA Preliminary evaluations Summary and ConclusionCUG 2017Copyright 2017 Cray Inc.2

Safe Harbor StatementThis presentation may contain forward-looking statements that are basedon our current expectations. Forward looking statements may includestatements about our financial guidance and expected operating results,our opportunities and future potential, our product development and newproduct introduction plans, our ability to expand and penetrate ouraddressable markets and other statements that are not historicalfacts. These statements are only predictions and actual results maymaterially vary from those projected. Please refer to Cray's documents filedwith the SEC from time to time concerning factors that could affect theCompany and these forward-looking statements.CUG 2017Copyright 2017 Cray Inc.3

Storage Hierarchy Data Path ConceptsO(1µs) Nonvolatile StoragePrivate Scratch NamespaceRelaxed POSIX or Key-Value APISharable Namespace (upon flush to backing store)App-Controlled Caching; Close-to-Open ConsistencyInter-Node Cache-Consistent in Small ClustersO(100µs) Nonvolatile StorageSharable Across ComputersLightweight Object or POSIX InterfacePrimary Resilient Random StorageFor HPC / Analytics / GeneralMulti-Second Data Distribution & ProtectionMulti-Site ReachBulk Object Storage APICloud Bursting, Disaster Recovery, ArchivalWAN / CloudTapeNANDO(5ns) In-Package MemoryCustomer WorkloadHBMNVRAMCPUsHigh SpeedCompute FabricSwitchIONodesSite NetworkSwitchArchiveStorageServers Switch HDDColdStorageServersNANDO(50ns) DDR MemoryCustomer WorkloadDRAMDistributed Storage ClientNode-Local Cache & Working MemoryRelaxed POSIX or Key-Value APILocal Cache ControlSharding and Resiliency ControlsCUG 2017NamespaceServersScalable Metadata ServiceServes Metadata to One Or More TiersProvide Key-Value, POSIX, namespace APIsAttributes and Rich Metadata StructuresMap App Structures to One AnotherCollections and Manifests of Large Data SetsCopyright 2017 Cray Inc.O(10ms) Backing StoreSite-Wide AccessBulk Object/Cloud Storage APIHigh-9’s Data ResiliencyStreaming Sequential Throughput4

Storage Media Latencies and IOPs100usNetworkDisk-Based( 5 msec)200usSofttwareStorage5000us200 IOPS*Flash/pmem( 0.0x msec)Flash RDMAl( 0.05 msec)252520k IOPS*With Pmeml( 0.03 msec)25 532K IOPS*Software becomes the largest fraction of latency when usingpersistent memory, even with 4x improved software efficiency* Max potential 1-thread random sectorCUG 2017Copyright 2017 Cray Inc.5

Cray Compute and Fabric TopologyHigh bandwidth DragonflyFabricGroup 0Group 1Group 2Flexible computeEnclosure-Based StoragePotentially 64k (or more) DevicesCUG 2017Group 3Group 4Group 5Group 6Group 7High density computeCompute Node-Local StoragePotential 256k NodesCopyright 2017 Cray Inc.6

Analytics and HPC Software ConvergencePOSIX Files,HDF5 Containers,K/VUser ApplicationUser ApplicationHPC File or Objectwith Optional CachingAnalytics Frameworkwith Local CachingPmemHigh-speed dragonfly fabricFlashFlashFlashCopyright 2017 Cray Inc.FlashFlashFlashFlashFlashFlashCUG park RDDs,K/V, or Other256k NodeManagement,Monitor,ServiceInfrastructure7

SAROJA Proof-of-ConceptCUG 2017Copyright 2017 Cray Inc.8

Scalable And Resilient ObJect StorAgeCompute NodesParallel ApplicationsPOSIXObjectNativeAPISAROJA client (libsaroja)ConsensusMetadataNoSQLServiceNVMe FlashCUG 2017DatapathPAXOS/RAFT/ZabClusterObject Storage ClusterPersistentMemoryNVMeFlashCopyright 2017 Cray Inc.9

Preliminary EvaluationsCUG 2017Copyright 2017 Cray Inc.10

Metadata Evaluations(Higher is better)Ceph vs Lustre: File Creation Rates12,000ceph fusecephfs (kernel)LustreCreates per second10,0008,0006,0004,0002,000012481632Number of client processes64128Ceph POSIX support still has a long way to goCUG 2017Copyright 2017 Cray Inc.11

Metadata Evaluations POSIX over SAROJA4480 MPI ranks56 XC compute nodes500 files/rankTCP over GNIReplication disabledCreates per secondSAROJA File Creates vs. Cassandra 00020,0000Peak Lustre filecreation rate(w/o DNE)1248Number of Cassandra ServersScaling trends not ideal; but promising approach functionallyCUG 2017Copyright 2017 Cray Inc.12

Data Path EvaluationCeph vs Lustre Throughput12,000Ceph (mean)Lustre (mean)Ceph (peak)Lustre (peak)Throughput (MB/s)10,0008,0006,0004,0002,000012324816Number of client processes64128Viable for use in the data path;Plenty of opportunities for tuningCUG 2017Copyright 2017 Cray Inc.13

Summary Inflection point in storage system design Three-tier storage topology for supercomputers Promising early investigations with object storage tech Gradual transition Call for feedbackCUG 2017Copyright 2017 Cray Inc.14

Legal DisclaimerInformation in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rightsis granted by this document.Cray Inc. may make changes to specifications and product descriptions at any time, without notice.All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate frompublished specifications. Current characterized errata are available on request.Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and otherthird parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internalcodenames is at the sole risk of the user.Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc.products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, andURIKA. The following are trademarks of Cray Inc.: APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX,LIBSCI, NODEKARE, REVEAL, THREADSTORM. The following system family marks, and associated model number marks, are trademarksof Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusivelicensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respectiveowners.CUG 2017Copyright 2017 Cray Inc.15

Questions & AnswersRaghu Chandrasekarraghu@cray.com

Ceph vs Lustre Throughput Ceph (mean) Lustre (mean) Ceph (peak) Lustre (peak) 0 2,000 4,000 6,000 8,000 10,000 12,000 1 2 4 8 16 Viable for use in the data path; Plenty of opportunities for tuning. Summary