Experiences On File Systems - Which Is The Best File System . - Indico

Transcription

Experiences on File SystemsWhich is the best file system for you?Jakob BlomerCERN PH/SFTCHEP 2015Okinawa, Japan1 / 22

Why Distributed File SystemsPhysics experiments store their files in a variety of file systemsfor a good reason File system interface is portable: we can take our local analysisapplication and run it anywhere on a big data set File system as a storage abstraction is asweet spot between data flexibility and data organization2 / 22

Why Distributed File SystemsPhysics experiments store their files in a variety of file systemsfor a good reason File system interface is portable: we can take our local analysisapplication and run it anywhere on a big data set File system as a storage abstraction is asweet spot between data flexibility and data organizationWe like file systems for their rich and standardized interfaceWe struggle finding an optimal implementation of that interface2 / 22

Shortlist of Distributed File SystemsQuantcastFile SystemCernVM-FS3 / 22

Agenda1What do we want from a distributed file system?2Sorting and searching the file system landscape3Technology trends and future challenges4 / 22

Can one size fit all?ThroughputMB/sChange FrequencyMeanFile SizeThroughputIOPSData Classes Home foldersDataValueVolumeConfidentialityCache Hit RateRedundancy Physics Data Recorded Simulated Analysis results Software binaries Scratch area[Data are illustrative]5 / 22

Can one size fit all?ThroughputMB/sChange FrequencyMeanFile SizeThroughputIOPSData Classes Home foldersDataValueVolumeConfidentialityCache Hit RateRedundancy Physics Data Recorded Simulated Analysis results Software binaries Scratch area[Data are illustrative]5 / 22

Can one size fit all?ThroughputMB/sChange FrequencyMeanFile SizeThroughputIOPSData Classes Home foldersDataValueVolumeConfidentialityCache Hit RateRedundancy Physics Data Recorded Simulated Analysis results Software binaries Scratch area[Data are illustrative]5 / 22

Can one size fit all?ThroughputMB/sChange FrequencyMeanFile SizeThroughputIOPSData Classes Home foldersDataValueVolumeConfidentialityCache Hit RateRedundancy Physics Data Recorded Simulated Analysis results Software binaries Scratch area[Data are illustrative]5 / 22

Can one size fit all?ThroughputMB/sChange FrequencyMeanFile SizeThroughputIOPSData Classes Home foldersDataValueVolumeConfidentialityCache Hit RateRedundancy Physics Data Recorded Simulated Analysis results Software binaries Scratch area[Data are illustrative]Depending on the use case, the dimensions span orders of magnitude5 / 22

POSIX Interface No DFS is fully POSIXcompliant It must provide just enough tonot break applications Often this can be onlydiscovered by testingFile system operationsessentialcreate(), unlink(), stat()open(), close(),read(), write(), seek()difficult for DFSsFile locksWrite-throughAtomic rename()File ownershipExtended attributesUnlink opened filesSymbolic links, hard linksDevice files, IPC files6 / 22

POSIX Interface No DFS is fully POSIXcompliant It must provide just enough tonot break applications Often this can be onlydiscovered by testingMissing APIsPhysical file location, file replicationproperties, file temperature, . . .File system operationsessentialcreate(), unlink(), stat()open(), close(),read(), write(), seek()difficult for DFSsFile locksWrite-throughAtomic rename()File ownershipExtended attributesUnlink opened filesSymbolic links, hard linksDevice files, IPC files6 / 22

Application-Defined File Systems?Mounted file systemFILE * f f o p e n (" susy . dat " , " r " ) ;while ( . . . ) {fread ( . . . ) ;.}fclose ( f );File system libraryhdfsFS f s hdfsConnect (" default " , 0);hdfsFile f hdfsOpenFile (fs , " susy . dat " , . . . ) ;while ( . . . ) {hdfsRead ( fs , f , . . . ) ;.}h d f s C l o s e F i l e ( fs , f ) ;Application independent from file systemPerformance tuned APIAllows for standard tools (ls, grep, . . . )Requires code changesSystem administrator selectsthe file systemApplication selectsthe file system7 / 22

Application-Defined File Systems?Mounted file systemFile system libraryFILE * f f o p e n (" susy . dat " , " r " ) ;while ( . . . ) {fread ( . . . ) ;.}fclose ( f );hdfsFS f s hdfsConnect (" default " , 0);hdfsFile f hdfsOpenFile (fs , " susy . dat " , . . . ) ;while ( . . . ) {hdfsRead ( fs , f , . . . ) ;.}h d f s C l o s e F i l e ( fs , f ) ;Application independent from file systemPerformance tuned APIAllows for standard tools (ls, grep, . . . )Requires code changesSystem administrator selectsthe file systemApplication selectsthe file systemWhat we ideally want is an application-defined, mountable file systemFuseParrot7 / 22

Agenda1What do we want from a distributed file system?2Sorting and searching the file system landscape3Technology trends and future challenges8 / 22

oseFSBig DataXtreemFSMapR FSLustreDistributedFile rangeFSCephNFSPanasasSharedDiskBeeGFSOCFS2GFS29 / 22

dropbox/owncloudTahoeLAFSHDFS PrivacyPersonal SharingFiles SyncQFS MapReduceworkflowsBig DataAFSMooseFS CommodityMapR FSXtreemFShardwareLustreDistributedFile Systems IncrementalscalabilityGeneral EasePurposeofCephadministrationGPFS Fast parallel inet, . . .OrangeFS High level ofSharedPanasasPOSIXDiskcomplianceBeeGFSOCFS2GFS29 / 22

dropbox/owncloudTahoeLAFSHDFS PrivacyPersonal SharingFiles SyncQFS MapReduceworkflowsBig DataAFSMooseFS CommodityMapR FSXtreemFShardwareDistributedFile SystemsLustre IncrementalscalabilityGeneral EasePurposeofCephadministrationGPFS Fast parallel inet, . . .OrangeFS High level sed in HEP9 / 22

dropbox/owncloudTahoeLAFSHDFS PrivacyPersonal SharingFiles SyncQFS MapReducedCacheworkflowsBig DataMooseFS CommodityMapR FSXtreemFShardwareXRootDHEPAFSDistributedFile SystemsLustre IncrementalscalabilityGeneral EasePurposeofCephadministrationCernVMFSGPFS Fast parallelGluster-FSwritesSuper-EOS(p)NFScomputers InfiniBand,Myrinet, . . .OrangeFS High level sed in HEP10 / 22

dropbox/owncloudTahoeLAFSHDFS PrivacyPersonal SharingFiles SyncQFS MapReducedCacheworkflowsBig DataXtreemFShardware Tape access WANfederation SoftwareHEPMooseFS CommodityMapR FSXRootDDistributedFile SystemsLustredistributionCernVMFS Fault-toleranceAFS IncrementalscalabilityGeneral EasePurposeofGPFS Fast n(p)NFScomputers InfiniBand,Myrinet, . . .OrangeFS High level sed in HEP10 / 22

File System ArchitectureExamples: Hadoop File System, Quantcast File Systemmeta-dataObject-based file systemdelete()create()dataread()write() Target: Incremental scaling, large & immutable filesTypical for Big Data applications 11 / 22

File System ArchitectureExamples: Hadoop File System, Quantcast File Systemmeta-dataObject-based file systemdelete()create()Head node can help in job schedulingdataread()write() Target: Incremental scaling, large & immutable filesTypical for Big Data applications 11 / 22

File System ArchitectureExamples: Lustre, MooseFS, pNFS, XtreemFSmeta-dataParallel file systemdelete()create()dataread()write() Target: Maximum aggregated throughput, large filesTypical for High-Performance Computing 12 / 22

File System ArchitectureExamples: Ceph, OrangeFSmeta-dataDistributed meta-datadelete()create()dataread()write() Target: Avoid single point of failure and meta-data bottleneckModern general-purpose distributed file system13 / 22

File System ArchitectureExamples: GlusterFSSymmetric, peer-to-peerhash(pathn )Distributed hash table —Hosts of pathn Target: Conceptual simplicity, inherently scalable14 / 22

File System ArchitectureExamples: GlusterFSSymmetric, peer-to-peerhash(pathn )Difficult to deal with node churnSlow lookup beyond LANIn HEP we use caching and catalog based data managementDistributed hash table —Hosts of pathn Target: Conceptual simplicity, inherently scalable14 / 22

Agenda1What do we want from a distributed file system?2Sorting and searching the file system landscape3Technology trends and future challenges15 / 22

Trends and ChallengesWe are lucky: large data sets tend to be immutable everywhereFor instance: media, backups, VM images, scientific data sets, . . . Reflected in hardware: shingled magnetic recording drives Reflected in software: log-structured space managementWe need to invest in scaling fault-tolerance and speedtogether with with the capacity Replication becomes too expensive at the Petabyte and Exabyte scale Erasure codes Explicit use of SSDs For meta-data (high IOPS requirement) As a fast storage pool As a node-local cache16 / 22

Trends and ChallengesWe are lucky: large data sets tend to be immutable everywhereFor instance: media, backups, VM images, scientific data sets, . . . Reflected in hardware: shingled magnetic recording drives Reflected in software: log-structured space managementWe need to invest in scaling fault-tolerance and speedtogether with with the capacity Replication becomes too expensive at the Petabyte and Exabyte scale Erasure codes Explicit use of SSDs For meta-data (high IOPS requirement) As a fast storage pool As a node-local cacheAmazon Elastic File Systemwill be on SSD16 / 22

Log-Structured DataIdea: Store all modifications in a change logUnix file systemDiskFileInodeDirectoryIndexLog-structured file system···Log Used byDiskAdvantages Zebra experimental DFS Minimal seek Commercial filers(e. g. NetApp) Fast and robust crash recovery Key-value and BLOB stores File systems for flash memory Efficient allocation inDRAM, flash, and disks Ideal for immutable data17 / 22

Bandwidth Lags Behind Capacity Improvement 1000Capacity BandwidthHDDDRAMSSDEthernet Capacity and bandwidthof affordable storage scaleat different paces It will be prohibitively 100costly to constantly“move data in and out” Ethernet bandwidth scaledsimilarly to capacity 10We should use the highbi-section bandwidth amongworker nodes and make thempart of the storage network 119952005Year201518 / 22

Fault Tolerance at the PetascaleHow to prevent data loss1Replication: simple, large storage overhead (similar to RAID 1)2Erasure codes: based on parity blocks (similar to RAID 5)Requires extracompute poweryDatayParitydistribute all blocksAvailable commercially: GPFS, NEC Hydrastor, Scality RING, . . .Emerging in open source file systems: EOS, Ceph, QFSEngineering Challenges Fault detection De-correlation of failure domains instead of random placement Failure prediction, e. g. based on MTTF and Markov models19 / 22

Fault Tolerance at the PetascaleExample from FacebookDatacenter 1Datacenter 3Block AA XOR BDatacenter 2Block BReplication factor 2.1using erasure codingwithin every data centerplus cross data centerXOR encodingMuralidhar et al. (2014)LinkFigure 9: Geo-replicated XOR Coding.arity block in a volume is XORed with the equiv-20 / 22

Towards a Better Implementation of File SystemsWorkflows Distributed File Systems Integration with applications Feature-rich: quotas, permissions, links Globally federated namespaceBuildingBlocksData ProvisioningObject Stores Very good scalability on local networks Flat namespace Turns out that certain types of datado not need a hierarchical namespacee. g. cached objects, media, VM images21 / 22

ConclusionFile Systems are at the Heart of our Distributed Computing Most convenient way of connecting our applications to data The hierarchical namespace is a natural way to organize dataWe can tailor file systems to our needsusing existing building blocksWith a focus on1Wide area data federation2Multi-level storage hierarchy from tape to memory3Erasure coding4Coalescing storage and compute nodes22 / 22

Thank you for your time!

Backup Slides

Milestones in Distributed File SystemsBiased towards open-source, production file OceanStoreGFSCephVentiXRootD25 / 22

Milestones in Distributed File SystemsBiased towards open-source, production file systems198319851995NFSZebra20002002200320052007 AFSOceanStoreGFSCephVentiXRootD Andrew File SystemClient-server“AFS was the first safe and efficient distributed computing system, available [. . . ] on campus. It was aclear precursor to the Dropbox-like software packages today. [. . . ] [It] allowed students (like DrewHouston and Arash Ferdowsi) access to all theirstuff from any connected computer.” Roaming home folders Identity tokens andaccess control lists (ACLs) Decentralized operation -dropbox25 / 22

Milestones in Distributed File SystemsBiased towards open-source, production file systems19831985199520002002200320052007 AFSNFSZebraOceanStoreGFSCephVentiXRootD Network File SystemClient-server Focus on portability Separation of protocoland implementation Stateless servers Fast crash recoverySandberg, Goldberg, Kleiman, Walsh, Lyon (1985)25 / 22

Milestones in Distributed File SystemsBiased towards open-source, production file systems19831995198520002002200320052007 AFSNFSZebraOceanStoreGFSCephVentiXRootD File BFile CFile AFile DZebra File SystemFileParallelManagerClientClient Striping and parity123456Client’s Log1231 2 34564 5 6File ServersOusterhoutFigure 4. Per-client striping in Hartman,Zebra. Eachclient(1995)forms its new file data into a single append-only log andstripes this log across the servers. In this example file Aspans several servers while file B is stored entirely on asingle server. Parity is computed for the log, not forindividual files.StriClea Redundant array ofNetworkinexpensive nodes (RAIN) eServerSSFigure 5: Zebra schematic. Clients run applicatstorage servers store data. The file manager andstripe cleaner can run on any machine in the sysalthough it is likely that one machine will run bothem. A storage server may also be a client.space reclaimed from the logs? Zebra solves thiswith a stripe cleaner, which is analogous to the cllog-structured file system. The next section25 /provid22

e;erleldaahea.ceMilestones in Distributed File SystemsBiased towards open-source, production file systems1983198519952002200320052007 AFSNFSZebraGFSCephOceanStoreVentiXRootD OceanStorePeer-to-peerpoolpool “Global Scale”:1010 users, 1014 filespoolpool Untrusted infrastructurepool Based on peer-to-peeroverlay networkpoolpoolitn-sutnic2000 Nomadic data throughaggressive caching Foundation for today’sdecentral dropbox replacementsKubiatowicz et al. (2000)Figure 1: The OceanStore system. The core of the system iscomposed of a multitude of highly connected “pools”, amongwhich data is allowed to “flow” freely. Clients connect to one ormore pools, perhaps intermittently.25 / 22

Emelie: storage sizeVentiActive file systemMilestones in Distributed File Systems450400Size (Gb)350Biased towards open-source, production file 2007 -98Jul-97Jan-97Jan-980 AFSZebraOceanStoreGFSCephXRootDVenti VentiArchival storageEmelie: ratio of archival to active data7Jukebox / ActiveVenti / Active6Ratio5 De-duplication throughcontent-addressable storage Content hashes provideintrinsic file integrity43 Merkle trees verify thefile system l-98Jan-98Jul-97Jan-97Quinlan and Dorward (2002)various sizes of two Plan 9 file servers.theFigure 6 depicts the size of the active file system as25 / 22

Milestones in Distributed File SystemsBiased towards open-source, production file systems19831985199520002002200320052007 AFSNFSZebraOceanStoreCephGFSVentiXRootD Google File SystemObject-based Co-designed for map-reduce Coalesce storage andcompute nodes Serialize data accessGoogle AppEngine documentation25 / 22

Milestones in Distributed File SystemsBiased towards open-source, production file systems19831985199520002002200320052007 AFSNFSZebraOceanStoreGFSCephVentiXRootD open()redirectopen()redirectopen()Client Global tree of redirectors Flexibility throughdecomposition intopluggable componentsxrootdcmsdxrootdcmsd641 64264 4096XRootDNamespace delegationxrootd Namespace independentfrom data Hanushevsky (2013)25 / 22

Milestones in Distributed File �···Biasedtowards open-source,production file 003A partial view of a four-level cluster map hierarchy consisting of rows, cabinets, andisks. Bold linesitems selected by each select operationin theplacement GFSAFSillustrateNFSOceanStoreZebraVentiitious mapping described by Table 5.1.20052007 CephXRootD ct(1,disk)emitResulting ⃗irootrow2cab21 cab23 cab24disk2107 disk2313 disk2437Weil (2007)Ceph File System and RADOSParallel, distributed meta-data Peer-to-peer file systemat the cluster scale Data placementacross failure domains Adaptive workloaddistributionA simple rule that distributes three replicas across three cabinets in the same row.rough any intermediate buckets, pseudo-randomly selecting a nested item in eachg the function c(r, x) (defined for each kind of bucket in Section 5.2.4), until it findshe requested type t. The resulting n ⃗i distinct items are placed back into the input ⃗i25 / 22

Hardware Bandwidth and Capacity NumbersMethod and entries marked † from Patterson (2004)Year199319941995200320042008Hard Disk DrivesCapacityBandwidthSSD (MLC Flash)Capacity Bandwidth (r/w)LinkCapacityDRAMBandwidth16 Mibit/chip†4.3 GB†9 MB/s†73.4 GB†86 MB/s†EthernetBandwidth267 MiB/s†100 Mbit/s†2012201420156 TB8 TB220 MB/s1195 MB/s2Increase 1860 24110 Gbit/s†512 Mibit/chip3.2 GiB/s160 GB250/100 MB/s3800 GB2 TB500/460 MB/s42.8/1.9 GB/s58 Gibit/chip25.6 GiB/s100 Gbit/s 12.5 11.2 512 98 1000http://www.storagereview.com/seagate enterprise capacity 6tb 35 sas hdd review v42http://www.storagereview.com/seagate archive hdd review 8tb3http://www.storagereview.com/intel x25-m ssd review4http://www.storagereview.com/intel ssd dc s3700 series enterprise ssd sd-dc-p3700-nvme,3858.htmlSeagate ST15150 (1994)† , Seagate 373453 (2004)† ,Seagate ST6000NM0034 (2014), Seagate ST8000AS0012 [SMR] (2015)SSD:Intel X25-M (2008), Intel SSD DC S3700 (2012), Intel SSD DC P3700 (2014)DRAM: Fast Page DRAM (1993)† , DDR2-400 (2004), DDR4-3200 (2014)Ethernet: Fast Ethernet IEEE 802.3u (1995)† , 10 GbitE IEEE 802.3ae (2003)† ,100 GbitE IEEE 802.3bj (2014)HDD:26 / 22

Data Integrity and File System SnapshotsRoot (/) Hash tree with cryptographichash function providessecure identifier for sub treesContents: A, Bhash(hA , hB )/A/BContents: 𝛼, 𝛽Contents: hash(h𝛼 , h𝛽 ) : hAhash( ) : hB It is easy to sign a small hashvalue (data authenticity) Efficient calculation of changes(fast replication) Bonus: versioning and datade-duplication/A/𝛼/A/𝛽Contents: data𝛼Contents: data𝛽hash(data𝛼 ) : h𝛼hash(data𝛽 ) : h𝛽Merkle tree27 / 22

Data Integrity and File System Snapshotshash(hA , hB )Contents: hA [A], hB [B]hash(h𝛼 , h𝛽 ) : hAhash( ) : hBContents: h𝛼 [𝛼], h𝛽 [𝛽]Contents: hash(data𝛼 ) : h𝛼hash(data𝛽 ) : h𝛽Contents: data𝛼Contents: data𝛽Merkle tree Content-addressable storage Hash tree with cryptographichash function providessecure identifier for sub trees It is easy to sign a small hashvalue (data authenticity) Efficient calculation of changes(fast replication) Bonus: versioning and datade-duplication Full potential together withcontent-addressable storage Self-verifying data chunks,trivial to distribute and cache27 / 22

System IntegrationSystem-aided Kernel-level file system Virtual NFS serverfastintrusiveeasy deploymentlimited by NFS semanticsUnprivileged options (mostly) Library (e. g. libhdfs, libXrd, . . . ) Interposition systems: Pre-loaded libraries ptrace based FUSE(“file system in user space”)tuned APInot transparentApplicationopen()libcsystem callOS KernelFUSEupcall28 / 22

Bibliography ISurvey ArticlesSatyanarayanan, M. (1990).A survey of distributed file systems.Annual Review of Computer Science, 4(1):73–104.Guan, P., Kuhl, M., Li, Z., and Liu, X. (2000).A survey of distributed file systems.University of California, San Diego.Thanh, T. D., Mohan, S., Choi, E., Kim, S., and Kim, P. (2008).A taxonomy and survey on distributed file systems.In Proc. int. conf. on Networked Computing and Advanced InformationManagement (NCM’08), pages 144 – 149.Peddemors, A., Kuun, C., Spoor, R., Dekkers, P., and den Besten, C. (2010).Survey of technologies for wide area distributed storage.Technical report, SURFnet.Depardon, B., Séguin, C., and Mahec, G. L. (2013).Analysis of six distributed file systems.Technical Report hal-00789086, Université de Picardie Jules Verne.29 / 22

Bibliography IIDonvito, G., Marzulli, G., and Diacono, D. (2014).Testing of several distributed file-systems (hdfs, ceph and glusterfs) forsupporting the hep experiment analysis.Journal of Physics: Conference Series, 513.File SystemsSandberg, R., Goldberg, D., Kleiman, S., Walsh, D., and Lyon, B. (1985).Design and implementation of the sun network filesystem.In Proc. of the Summer USENIX conference, pages 119–130.Morris, J. H., Satyanarayanan, M., Conner, M. H., Howard, J. H., Rosenthal, D.S. H., and Smith, F. D. (1986).Andrew: A distributed personal computing environment.Communications of the ACM, 29(3):184–201.Hartman, J. H. and Osterhout, J. K. (1995).The Zebra striped network file system.ACM Transactions on Computer Systems, 13(3):274–310.30 / 22

Bibliography IIIKubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D.,Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B.(2000).OceanStore: An architecture for global-scale persistent storage.ACM SIGPLAN Notices, 35(11):190–201.Quinlan, S. and Dorward, S. (2002).Venti: a new approach to archival storage.In Proc. of the 1st USENIX Conf. on File and Storage Technologies (FAST’02),pages 89–102.Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003).The Google file system.ACM SIGOPS Operating Systems Review, 37(5):29–43.Schwan, P. (2003).Lustre: Building a file system for 1,000-node clusters.In Proc. of the 2003 Linux Symposium, pages 380–386.Dorigo, A., Elmer, P., Furano, F., and Hanushevsky, A. (2005).XROOTD - a highly scalable architecture for data access.WSEAS Transactions on Computers, 4(4):348–353.31 / 22

Bibliography IVWeil, S. A. (2007).Ceph: reliable, scalable, and high-performance distributed storage.PhD thesis, University of California Santa Cruz.32 / 22

Ceph XtreemFS MooseFS Personal Files AFS drop-box/own-cloud Tahoe-LAFS BigData HDFS QFS MapR FS 9/22. Distributed FileSystems Super-computers Lustre GPFS Orange-FS BeeGFS Panasas Shared Disk GFS2 OCFS2 General Purpose (p)NFS Gluster-FS Ceph XtreemFS MooseFS Personal Files AFS drop-box/own-cloud Tahoe-LAFS BigData HDFS QFS MapR FS