HPC Growing Pains - Webapps.lehigh.edu

Transcription

HPC Growing PainsIT Lessons Learned from the Biomedical Data DelugeJohn L. WoffordCenter for Computational Biology & BioinformaticsColumbia UniversityTuesday, March 27, 12

What is Internationally recognized biomedical computing center. More than 15 Labs and nearly 200 faculty, staff, students and postdocsacross multiple campuses and 8 departments. IT staff of 8, covering everything from desktop to HPC & datacenter.Broad range of computational biomedical and biology research, frombiophysics to genomics.AffiliatesTuesday, March 27, 12?

HPC Growing PainsOver the past 3 years we have grown our HPC resources by anorder of magnitude, driven largely by genomic data storage andprocessing demands.Tuesday, March 27, 12Before (2008)After (2012)Total CPU-cores 500 4500Largest cluster: CPU400 core 4000 coreLargest cluster: Memory800 GB8 TBAnnual CPU-hr2 M CPU-hrs 50 M CPU-hrsAverage Daily Active Users20120Storage capacity30 TB 1PBData Center space800 sq.ft.4000 sq.ft.

OutlineIT Lessons Learned from the Biomedical Data DelugeI.Intro: Biomedical data growthII.Storage challengesII.1. PerformanceII.2. CapacityII.3. Data integrityIII. ConclusionsTuesday, March 27, 12

What data deluge? A few stock facts: With a collective 269 petabytes of data, education was among the U.S.economy’s top 10 sectors storing the largest amount of data in 2009,according to a McKinsey Global Institute survey. The world will generate 1.8 zettabytes of data this year alone, accordingto IDC’s 2011 Digital Universe survey. Worldwide data volume is growing a minimum of 59% annually, aGartner report estimates, outrunning Kryder’s law for disk capacity percost growth. Biomedical data–primarily driven by gene sequencing–is growing dramaticallyfaster than industry average and Kryder’s law. .not only does that data need to be stored, it needs to be heavily analyzed,demanding both performance and capacity.Tuesday, March 27, 12

Sequence data production ratesTuesday, March 27, 12

From Kahn S.D. On the future of genomic data. Science 2011;331:728-729.Tuesday, March 27, 12

C2B2 data growth, June ’08 - March ’123000Raw(50 req% uireoverh menea td)getrend1500UsaData Usage (TB)2250Raw capacity750ageUsica(logl)Industry trend(59% annualgrowth)001530Months since June ’08Tuesday, March 27, 124560

The storage challenge:Design a storage system that can:1. Perform well enough to analyze the data (i.e. standup to a top500 supercomputer);2. Scale from from TeraBytes to PetaBytes (withouthaving to constantly rebuild);3. Protect important data (from crashes, users andfloods);Tuesday, March 27, 12

PerformanceFor parallel processes you need parallel file access. We have 4000 CPUs working around the clock onanalyzing data; we want to keep them all fed with dataall of the time. Our workload is “random” and not well behaved. It’snotoriously difficult to design for this kind ofworkload. Ideally, we want a solution that can beflexible as our workloads change. The more disks we have spinning in parallel, the betterthe performance. We’re going to need a lot of disks.Using a rough heuristic, 1 disk per compute nodewould mean 500 disks. But, to make that many disks useful, we’re going toneed a lot of processing and network capabilities.Tuesday, March 27, 12

Traditional NAS ArchitectureSingle NAS head with multiple disk arraysNetapp, Bluearc, EMC. Pro Support: Time-testedarchitecture with many major,competing vendors. Capacity scaling: relatively easyon modern NAS.Con Tuesday, March 27, 12Performance scaling: is difficult &unpredictable.Management: Storage poolsmust be managed and tuned.Reliability: NAS head providessingle failure point.Traditional NASSingle controller, many arraysNAS HeadProcesses network file storage requests (NFS,CIFS, etc). Manage SAN storage pools.CPUCacheNetworkSAN or Direct AttachedInterconnect(FC, SATA, )Storage client* Cluster node, Server,Desktop, .SAN Storage nodeSSDStorageSATASANnode SATASSDStorageSATASANnode SATADiskStorage client* Cluster node, Server,Desktop, .Storage client* Cluster node, Server,Desktop, .Disk arraysJBODs or RAIDSSDStorage client* Cluster node, Server,Desktop, .SATADiskSATADiskStorage client* Cluster node, Server,Desktop, .Storage client* Cluster node, Server,Desktop, .

Clustered NAS architectureSingle filesystem distributed across many nodes.Isilon, Panasas, Gluster,. Clustered NASA single filesystem is presented by multiplenodes which each process filesystem requests.Pro Capacity scaling: new nodesautomatically integrate.Performance scaling: new nodesadd CPU, Cache and networkperformance.Reliability: most architecturescan survive multiple nodefailures.Con Tuesday, March 27, 12Support: Relatively newtechnology. Few vendors (butrapidly growing).High speed backend network (transfers data between nodes). Clustered NAS nodeCPUCacheNetworkStorage client* Cluster node, Server,Desktop, .Disk poolDiskDiskDiskStorage client* Cluster node, Server,Desktop, .Clustered NAS nodeCPUCacheNetworkDisk poolDiskDiskStorage client* Cluster node, Server,Desktop, .DiskClustered NAS nodeCPUStorage client* Cluster node, Server,Desktop, .CacheNetworkStorage client* Cluster node, Server,Desktop, .Disk poolDiskDiskDiskStorage client* Cluster node, Server,Desktop, .

Clustered NAS doesn’t solveeverything. In late 2009 we had scaled to where we thoughtwe should be, but our system was unresponsive,with constant, very high load. More puzzling, was that the load didn’t seem tohave anything correlation with the network, diskthroughput, or even load on the compute cluster.Tuesday, March 27, 12

What we found(With the help of good analytics)Tuesday, March 27, 12

Namespace readsNamespace operations consume CPU,and waste I/O capabilities. It’s common in biomedical data to havethousands, or millions of small files on aproject. We have 500M files, with an averagefilesize of less than 8 kb. Many genome “databases” are directorystructures of flat files that get “indexed” byfilename (NCBI, for instance, hosts 20kfiles in their databases). Our system was thrashing, and we weren’tgetting a lot of I/O. the 40% namespacereads were killing our performance.Tuesday, March 27, 12Distribution of Protocol OperationsOther13%Namespace Read40%Write15%Read32%

SSD Accelerated NamespaceSold Sate Disks provide fastseek time for namespace reads.Namespace reads are seek intensive (notthroughput intensive). SSDs generally have 40x seek time ofspinning disks. We were able to spread our filesystemmetadata over SSDs on our nodes,dramatically increasing namespaceperformance and decreasing system load. We experienced an immediate, 8xincrease in filesystem responsiveness, andoverall system performance increase.SSD Enabled NodeMetadataFilesystem metadata SSD Accelerated NamespaceNamespace data live on SSDs (very low seek time),Ordinary data lives on ordinary disks.SSDDiskDiskSSD Enabled NodeMetadataSSDDataDiskDiskSSD Enabled NodeMetadataSSDTuesday, March 27, 12DataDataDiskDisk

CapacityHow we scaled a single filesystem from 168 TB to 1 PBTuesday, March 27, 121.While the High-performance Clustered NASnaturally scales capacity as well as performance, it’san expensive way to build capacity.2.We don’t want to have the “big” filesystem and the“fast” filesystem. We want everyone to see thesame files from everywhere.3.High-performance systems benefit from uniformhardware. Capacity scaling benefits from beingable to use the latest, biggest, densest disks.4.In fact, typically a clustered NAS is made of entirelyuniform hardware, so how do you update withouttime consuming data migrations?

Multi-tiered Clustered NASSingle filesystem distributed across many pools of nodes.High-Performance ComputeClusters Tuesday, March 27, 12Multiple “pools” of nodesshare a single filesystemnamespace.Different pools can havedifferent performance/capacity,allowing for independentscaling of capacity andperformance.New pools can be added, andold removed, allowing seamlessupgrades.All pools are active, so nodesconfigured for large capacitycan still serve data to lowdemand devices.Multi-tiered Clustered NASmultiple pools, single namespaceHigh-speed storage poolcompute clusters, sequencers, etc.Blade chassisVirtual machineVirtual machineCompute nodeCompute nodeSSD-accelerated storage nodeSSDSATA storageSATAnodeSSD-acceleratedSSDSATA storageSATAnodeSSD-acceleratedSSDSSDSATASATARule baseddata migrationSATASATAData generatingresearch equipment(e.g. gene sequencers)Sharedbackend networkHigh-capacity storage poolinfrastructure servers, desktops, etc.Nearline storage nodeServer InfrastructureVirtual machineVirtual machineServerNearline storage nodeSSDSATASATANearline storage Desktops &Workstations

Evolution of a filesystem2011Capacity upgrade(denser nodes)Cap: 984 TB2010 Q4Swap original nodesfor new nodesCap: 648 TB2010 Q31. Merged pools2. New nodes (SSD)Cap: 672 TBRaw capacityent1000Re750quirem2010 Q1new nodes(separate cluster)496 TB2009new nodesCap: 276 TB5002008inceptionCap: 168 TB25000Tuesday, March 27, 1212.525.037.550.0

Caveat: A single namespace has challenges Since the namespace doesn’t refresh, it tends to grow and grow. Wecurrently have 500 Million files in our filesystem. While it’s nice to have all of your files in one space, it takes a lot of effort tokeep it organized. In initial deployment, spent roughly 30x longer planning filesystem structurethan deploying hardware. If you plan your filesystem poorly, it could take a long time to relocate orremove all of those files.Tuesday, March 27, 12

Data IntegrityHow do you protect large-scale, important datafrom users, glitches & floods?1. Users: “Oops! I didn’t mean to deletethat!”2. Glitches: “Error mounting: mount: wrong fs type, bad option, bad superblock on /dev/. ”3. Floods: “Who installed the water mainover the data center?!”Tuesday, March 27, 12

Tape vs. Disk-to-DiskTape is dead. Right?Tape backupTapeD-t-DNASFile serverFile serverFastEasy tomaintainReliableBackup serverTape LibraryCheapLow-powerLong shelflifeTuesday, March 27, 12Backup serverTapeTapeTapeDisk-to-disk backupPrimary NASSecondary NAS

Neither option is ideal on itsown. Using only tape is impractical. LTO5 can write 1TBin 6hr, or 200 TB in 50 days. You can split acrossdrives, but this becomes a management nightmare.Using only disk is cost prohibitive. A complete disksystem vs. a complete tape system. Plus: Tuesday, March 27, 12you need to keep disk powered.there’s no easy (or safe) way to archive disks.it has to grow faster than primary storage (if you wanthistorical archives).

Our middle groundSnapshots replication tape protection from: Tuesday, March 27, 12Users: frequent snapshots on thesource provide easy “oops” recoveryto the user.Glitches: replication provides shortterm rapid recovery. Addedsnapshots extend replication archivesto the mid-term ( 6 mo.).Floods: Tape backup provides cheap,reliable archival of data, for largescale disaster recovery (or importantfiles from ’06). Leaving backupwindows flexible keeps tapemanageable.Data backup pathPrimary storage clusterEasy user recovery with short-term snapshots.Storage nodeSSD nodeSATASATAStorageSSD nodeSATASATAStorageSSDSATASATADiskDiskDiskFrequent replication (daily)Replication clusterLive replication of critical data, with historicalsnapshotsReplication storage nodeReplication storage nodeSATASATA SATASATASATASATASATASATASATASATASATALazy archival( 6 mo.)Tape LibraryOffsite DRArchivalTape storageLong-termarchival1 copy per yearTapeTapeTapeTapeTapeTape

The big picture:Multi-tiered scale-out storage architecturefrom HPC Infrastructure to the DesktopPrimary Storage Clustermultiple pools, single filesystemHigh-Performance ComputeClustersBlade chassis10 Gbps Aggregation networkVirtual machineVirtual machineCompute nodeCompute nodeBlade chassisVirtual machineVirtual machineCompute nodeCompute nodeHigh-speed storage poolcompute clusters, sequencers, etc.SSD-accelerated storage nodeSATAnodeSSDSATA storageSSD-acceleratedSSDSATA TAData generatingresearch equipment(e.g. gene sequencers)Rule baseddata migrationServer InfrastructureNearline storage poolinfrastructure servers, desktops, etc.Nearline storage nodeSSDSATA nodeSATANearlinestorageSSDSATA ual machineVirtual machinePhysical serversVirtualization InfrastructureVirtual machineVirtual machineVirtual machineBackup InfrastructureTape storageLong-termarchivalTapeTapeTapeTuesday, March 27, 12Tape LibraryOffsite DRArchivalTapeTapeTapeReplication clusterLive replication of critical dataDesktops &WorkstationsReplication storage nodeReplication storage nodeSATASATA SATASATASATASATASATASATASATASATASATA

The big picture:Multi-tiered scale-out storage architecturefrom HPC Infrastructure to the DesktopPrimary Storage Clustermultiple pools, single filesystemHigh-Performance ComputeClustersBlade chassis10 Gbps Aggregation networkVirtual machineVirtual machineCompute nodeCompute nodeBlade chassisVirtual machineVirtual machineCompute nodeCompute nodeHigh-speed storage poolcompute clusters, sequencers, etc.SSD-accelerated storage nodeSATAnodeSSDSATA storageSSD-acceleratedSSDSATA TAData generatingresearch equipment(e.g. gene sequencers)Rule baseddata migrationServer InfrastructureNearline storage poolinfrastructure servers, desktops, etc.Nearline storage nodeSSDSATA nodeSATANearlinestorageSSDSATA ual machineVirtual machinePhysical serversVirtualization InfrastructureVirtual machineVirtual machineVirtual machineBackup InfrastructureTape storageLong-termarchivalTapeTapeTapeTuesday, March 27, 12Tape LibraryOffsite DRArchivalTapeTapeTapeReplication clusterLive replication of critical dataDesktops &WorkstationsReplication storage nodeReplication storage nodeSATASATA SATASATASATASATASATASATASATASATASATA

The big picture:Multi-tiered scale-out storage architecturefrom HPC Infrastructure to the DesktopPrimary Storage Clustermultiple pools, single filesystemHigh-Performance ComputeClustersBlade chassis10 Gbps Aggregation networkVirtual machineVirtual machineCompute nodeCompute nodeBlade chassisVirtual machineVirtual machineCompute nodeCompute nodeHigh-speed storage poolcompute clusters, sequencers, etc.SSD-accelerated storage nodeSATAnodeSSDSATA storageSSD-acceleratedSSDSATA TAData generatingresearch equipment(e.g. gene sequencers)Rule baseddata migrationServer InfrastructureNearline storage poolinfrastructure servers, desktops, etc.Nearline storage nodeSSDSATA nodeSATANearlinestorageSSDSATA ual machineVirtual machinePhysical serversVirtualization InfrastructureVirtual machineVirtual machineVirtual machineBackup InfrastructureTape storageLong-termarchivalTapeTapeTapeTuesday, March 27, 12Tape LibraryOffsite DRArchivalTapeTapeTapeReplication clusterLive replication of critical dataDesktops &WorkstationsReplication storage nodeReplication storage nodeSATASATA SATASATASATASATASATASATASATASATASATA

ConclusionPutting it all together. With Clustered NAS and SSD acceleration, we’re regularly seeingfilesystem throughput in excess of 10 Gbps and IOPs well over500k without an issue. So far we’ve managed to stay ahead of our data-growth curvewith multi-tiered storage. We plan to at least double capacity inthe next 6-12 months with no major architectural changes. With combination of snapshots, disk-to-disk replication and tape,we’re getting daily backups of all important data as well as longterm archivals. Thank you! Questions?Tuesday, March 27, 12

to IDC's 2011 Digital Universe survey. Worldwide data volume is growing a minimum of 59% annually, a Gartner report estimates, outrunning Kryder's law for disk capacity per cost growth. Biomedical data-primarily driven by gene sequencing-is growing dramatically faster than industry average and Kryder's law.