Implementing A Hierarchical Storage Management System In A Large . - CUG

Transcription

Implementing a Hierarchical StorageManagement system in a large-scale Lustreand HPSS environmentBrett Bode, Michelle Butler, Sean Stevens, Jim GlasgowNational Center for Supercomputing Applications/University of IllinoisNate Schumann, Frank ZagoCray Inc.

Background What is Hierarchical Storage Management? A single namespace view for multiple physical/logicalstorage systems Automated movement of data to the lower tiers, usuallybased on configurable policies Data is returned to the top tier based on request oraccess.CUG 2017

Traditional HSM HSM environments have been used for many years to front-end tapearchives with a ”disk cache”. Usually the environment was isolated, the only actions on data areto transfer to/from the system All data is expected to be written to the back-end. Most data is accessed infrequentlyDisk cacheUser accessessingle namespacevia explicit datatransfer to/from theenvironmentAutomatically migrate by policyDual resident and release as neededReleased data requested is retrievedCUG 2017Tape Archive

Blue Waters Computing SystemAggregate Memory – 1.66 PBScuba Subsystem Storage Configurationfor User Best Access120 GB/sec400 Gbps WANIB Switch10/40/100 GbEthernet SwitchExternal Servers 1 TB/sec66 GB/secSpectra Logic: 200 usable PBCUG 2017Sonexion: 26 usable PB

Gemini Fabric (HSN)DSL48 NodesResourceManager (MOM)64 NodesXE6 Compute Nodes - 5,659 Blades – 22,636 Nodes –362,176 FP (bulldozer) Cores – 724,352 Integer CoresBOOTSDB2 Nodes 2 NodesBoot RAIDCray XE6/XK7 - 288 CabinetsRSIP12NodesNetwork GW8 NodesUnassigned74 NodesLNET Routers582 NodesesLogin4 NodesSMWInfiniBand fabricBoot CabinetImport/ExportNodes10/40/100 GbEthernet SwitchHPSS Data MoverNodesCyber Protection IDPSManagement NodeNCSAnetNPCFXK7 GPU Nodes1057 Blades – 4,228 Nodes33,824 FP Cores – 4,228 GPUsSonexion25 usable PB online storage36 racksNear-Line Storage200 usable PBesServers CabinetsSupporting systems: LDAP, RSA, Portal, JIRA, Globus CA,Bro, test systems, Accounts/Allocations, WikiCUG 2017

Today’s Data ManagementscratchUser manages filelocation betweenscratch and Nearline30 dayPurgeGlobuscommandsNearlineCUG 2017

Hierarchical Storage Management (HSM) Vision:scratchUser accessessingle namespacePurgeAutomatically migrate by policyDual resident and release as neededNearlineReleased data requested is retrievedAlternateonlinetarget Purge no longer employed Allows policy parameters to manage filesystem free space Users limited by lustre quota and back-end quota (out of band)CUG 2017

HSM Design Lustre 2.5 or later required for HSM support – NCSAcompleted upgrade for all file systems in August 2016(Sonexion Neo 2.0) Lustre copy tool provided via co-design & developmentwith Cray Tiered Adaptive Storage Connector product Cray provides bulk of the copy tool with a pluginarchitecture for various back-ends. NCSA develops a plugin to interface with HPSS tapesystems Specifications created for resiliency, quotas, disasterrecovery, etc.CUG 2017

Lustre HSM Support Lustre (2.5 and later) provides tracking for files regardlessof the location of the data blocks. File information remains on the MDS, but none onOSTs Commands are provided to initiate data movement and torelease the lustre data blocks lsf hsm [archive release restore etc] Lustre tracks the requests, but does not provide anybuilt in data movement. A copy tool component is required to register with Lustrein order for any data movement to occur.CUG 2017

HSM Policies Policy engine provided by Robinhood Policies drive scripts that execute the lsf hsm *commands. It is very early in the policy development process so theseare early thoughts. Files will be copied to the back end after 7 days Estimated based on a review of the churn in the scratchfile system. Files will be released when the file system reaches 80% fullbased on age and size. Files below 16MB will never be released and will be copiedto a secondary file system rather than HPSSCUG 2017

Cray’s Connector Cray’s connector consists of two components Connector Migration Manager (CMM) registers with the Lustre file system as a copy tool. It is responsible for queuing and scheduling LustreHSM requests that are received from the MDTacross one or more CMAs. Connector Migration Agents (CMA) Perform all data movement within the Connector,and are also responsible for removing archive filesfrom back-end storage if requested.CUG 2017

CMA Plugins The CMA plugin architecture allows multipleback-ends to interface with the CMM. The CMA also allows threading transfers acrossmultiple agents. Several sample CMAs are provided to copy datato a secondary file system.CUG 2017

HPSS Plugin The NCSA HPSS CMA plugin is called HTAP. Will be released as open source soon. Utilizes HPSS API to provide authentication anddata transfer to/from the HPSS environment. Transfers can be parallelized to match HPSS COSCUG 2017

Data IO Data IO for all file transfers is done at the nativestripe width for the selected HPSS class ofservice this is generally smaller than the Lustre stripe widthhowever, files are restored to their original Lustre stripewidth All file archive and retrieve operations are furtherverified by full checksums checksums are kept with the files in user extendedattributes and HPSS UDAs the CMA provides parallel and inline checksum capabilitiesin a variety of standard methodsCUG 2017

Lustre HSM Cray TAS HPSS“scratch”NearlineUser applicationwrites files toLustreRobinhoodupdated viachangelogsData MoverCMM/CMARobinhood DBCUG 2017

Time Passes“scratch”NearlineData MoverCMM/CMARobinhood DBlfs hsm state afileafile: (0x00000000) --- new file not archivedCUG 2017

Policy Triggers Copy to Back End“scratch”Policy engineissues lsfhsm archivecommandsRobinhood policyselects file basedon size/age/etcfile attributesNearlineData MoverCMM/CMARobinhood DBCUG 2017

File Copy“scratch”Lustre issues copyrequest to the copytool (CMM)NearlineData MoverCMM/CMARobinhood DBCUG 2017

File Copy“scratch”NearlineData MoverCMM/CMARobinhood DBCMM kicks off CMA(s)which copy the file to theback end and then storethe backend metadata asextended lustre attrs.lfs hsm state somefilesomefile: (0x00000009) exists archived, archive id:1 --- file archived, not releasedCUG 2017

File Release“scratch”NearlineData MoverCMM/CMAAt a later time Robinhoodpolicies choose a list offiles to release in order tofree file system space.Data blocks are freed,but metadata remains.Robinhood DBlfs hsm state somefilesomefile: (0x0000000d) released exists archived, archive id:1 --- file archived and releasedCUG 2017

File Restore“scratch”NearlineData MoverCMM/CMAUser makes request tostage files (or issues afile open)Robinhood DBCUG 2017

File Restore“scratch”NearlineData MoverCMM/CMAUser makes request tostage files (or issues afile open)Robinhood DBlfs hsm state somefilesomefile: (0x00000009) exists archived, archive id:1 --- file archived, not releasedCUG 2017

Backup? HSM is NOT equivalent to a backup! One site uses HSM functions to dual-copy data sort of a backup in the case of a fault in theprimary storage. However, deleting a file in the file system quicklyresults in it being deleted from the back-end. Thus, HSM does not protect against user mistakes. One could create a backend that did file versioningand delayed deletes, but that goes well beyond thecurrent work.CUG 2017

Initial Testing 900 files/min Ok rate. Unclear bottleneck. 900 MB/s Good rate, reasonable fraction of the resources forthe single data mover node/lustre client The Lustre hsm.max requests setting must betuned. In the limited test system increasing it from3 to 6 gave good results.CUG 2017

Future Work Much more testing, particularly testing at scale. Development and scaled testing of HPSS pluginCreate HA Robinhood (policy engine) setupWorkload manager integrationCray, Blue Waters, site team is investigatingLustre Bug that requires MDS failover to clearhung transfersCUG 2017

Conclusions The initial development and testing of theHPSS/Cray TAS HSM implementation is showingfull functionality and good initial performance. Challenges remain in crafting effective policies. Effective production use will require usereducation and assistance. Data must be staged before use in a batch job! The lack of a unified quota system will confuseusers.CUG 2017

Questions/Acknowledgements Supported by: The National Science Foundation through awardsOCI-0725070 and ACI-1238993 The State and University of IllinoisCUG 2017

registers with the Lustrefile system as a copy tool. It is responsible for queuing and scheduling Lustre HSM requests that are received from the MDT across one or more CMAs. Connector Migration Agents (CMA) Perform all data movement within the Connector, and are also responsible for removing archive files from back-end storage if .