Exporting Lustre With Nfs-ganesha - Eofs

Transcription

EXPORTING LUSTRE WITHNFS-GANESHAPhilippe Deniel, Dominique Martinet( philippe.deniel@cea.fr , dominique.martinet@cea.fr )LAD2013 September 16-17, 2013 PAGE 1

WHAT IS NFS-GANESHANFS-Ganesha was born atCEA/DAM in 2005Original need was to export HPSS over NFSIBM stopped supporting this featureThe hpss nfs daemon was reallyunreliable and with poor cachingcapabilitiesWe designed something of our own in 4Q2004We start coding in January 2005, once adesign document had been writtenGanesha was designed with more thanHPSS in mindNFS-Ganesha is in productionsince early 2006First used to export HPSS to TERA10 systemUsed to export LUSTRE at TGCC in 2011, infront of CCRT’s compute machinesLAD2013CEA September, 16-17, 2013

FEATURESNFS-Ganesha has known many evolutions. Currently itincludes the following feature (non-exhaustive list)Supported protocolsNFSv3 and ancillary protocols (MOUNTDv3, NLMv4, client side of NSM)- NLMv4 implementation supports SHARE/UNSHARE used by MicrosoftNFS clientNFSv4.0 (including lock support)NFSv4.1 (including pNFS support)9p.2000L (with TCP and RDMA transport layers)Supported backends (known as FSAL : File System Abstraction Layer) areCEPHGPFSHPSSPROXY (operates as a NFSv4 client to turn Ganesha into a NFS PROXY)LUSTRE 2.xZFS (content of a ZFS tank)VFS (with kernel 2.6.39. Makes it possible to export every FS managed bythe kernel’s VFS)LAD2013CEA September, 16-17, 2013

GANESHA’S ARCHITECTUREclients requestsNetwork ForechannelNetworkBackchannelRPC Dispatcher threadsMOUNTv3NLMv4, RQUOTALoggingNFSv4.x/pNFS9P (TCP and RDMA)Cache fs callbacksCache Inodefs operationsSALCache Inode UPfs callbacksFile System Abstraction LayerAdminSNMP via AgentX (optional)NFS V3Cache fs operationsFile Content layerLAD2013RPCSEC GSSBackend (POSIX, XFS, ZFS, PROXY,GPFS, SNMP, CEPH, HPSS, LUSTRE, VFS)FSAL UPHash TablesStatsGSSAPIDup Req LayerCEA September, 16-17, 2013

NFS-GANESHA COMMUNITYNFS-Ganesha was released as free software onJuly 4th, 2007Available on esha is available under the terms of the LGPLv3 licenseA Community starts to developCEA/DAM is still active in the developmentmanage FSAL HPSS, FSAL PROXY and FSAL LUSTRE, 9P andRDMA based transportIBM became an active member of the community in late 2009Ganesha is to be integrated in SONAS as NFS GatewayIBM is in charge of FSAL GPFS and SAL (states management layer)LinuxBox (a small company created by former CITI folks) joined the communityin september 2010They are very active on NFSv4.1 with focus on CEPHPanasas joined the community in May 2011Ganesha is to be used as NFSv4.1/pNFS MDS in Panasas ProductLAD2013CEA September, 16-17, 2013

MORE FOCUS NFS-GANESHA LUSTREFSAL LUSTRE provides access to LUSTRE forNFS-Ganesha daemonFSALs are provided as a dynamic library to be dlopen-ed at startup byganesha.nfsd daemon (in Ganesha 2.0)Based on a few LUSTRE featuresUses “.lustre/fid” special directory to access objectsCalls from liblustreapi- Fid2path- path2fidProvides access to xattrNative feature in 9p2000.L and NFSv4.xMakes use of “ghost directories” in NFSv3 and NFSv4 (Linux has noNFSv4 client support for extended attributes as Solaris does)LAD2013CEA September, 16-17, 2013

MORE FOCUS NFS-GANESHA LUSTREFuture cool features for LUSTREpNFS support (using file based layout) for FSAL LUSTREMain discussion is about placing pNFS Data Servers correctlyIt seems logical to place them closer as possible to OSSs, or even running onOSSs- The latest choice would make the translation from LUSTRE layout to pNFSlayout easierMemory pressure should be considered- pNFS/DS are rather stateless creatures (the states are managed by thepNFS/MDS)- Ganesha as pNFS/DS would be redesigned with reduced cachesUse LUSTRE changelogs to implement “FSAL upcalls” (as GPFS does) to updatecaches as LUSTRE changesUpcalls are trapped by a pool of Ganesha’s threads- Related cached inode is removed from the cacheWould make NFS-Ganesha caches coherent with LUSTRE- Would make Ganesha fully compliant with NFSv4.1 (as RFC5661 says)Would help in clustering NFS-Ganesha server on top of LUSTRELAD2013CEA September, 16-17, 2013

BENCH: HARDWARE AND SOFTWARE USEDDetails of benchmark configurationHardwareClients are BULL B500 nodes- 4 sockets, Nehalem processors (8 cores)- 64 GB RAMLustre MDS and OSS- Bull MESCA S6030 nodes, 4 sockets Nehalem (8 cores) , 64 GB RAMNetwork is Mellanox QDR InfinibandSoftwareLustre 2.1.4 sur BULL AE2.2 (based of EL6.1)Clients are running BULL AE2.2Ganesha pre-2.0-dev 42-40-gd3b8c25 (yes, that’s a “git describe –long” ;-) )with mooshika-0.3.7-gb3e264aLAD2013CEA September, 16-17, 2013

BENCH : GANESHA VS KNFSD (METADATA 1/3)RESULTS OF MDTEST: directory create/stats/rminode/s by number of clientsMDTEST: stats/s by number of s#clientsUnlink/s by number of /tcp1000Knfsd is better than Ganesha onDirectory metadata management,Especially on stats (possible CEA September, 16-17, 2013

BENCH : GANESHA VS KNFSD (METADATA 2/3)RESULTS OF MDTEST: files create/stats/rmStats/s by number of e/s by number of 008020406080#clients#clientsrm/sUnlink/s by number of gsh9p/rdma0204060Knfsd is better than Ganesha onFile metadata management,too80#clientsLAD2013CEA September, 16-17, 2013

BENCH : GANESHA VS KNFSD (METADATA 2/3)RESULTS OF MDTEST: files create/stats/rmTree created/s by number of clients600Trees created/s500400knfsd/tcp300Knfsd and Ganesha have similarperformances on tree operationsgshv3/tcp200gsh9p/rdma10000204060Ganesha becomes slightly better asthe number of client increases80#clientTree removed/s by number of client400Tree removed/s350300250200knfsd/tcp 495,294150gshv3/tcp 327,262100gsh9p/rdma 313,2975000LAD20132040#clients6080CEA September, 16-17, 2013

BENCH : GANESHA VS KNFSD (DD READ)MB/ssingle client reads with dd16001400120010008006004002000Lustre natif (read)Ganesha v3/tcp (read)knfsd v3/tcp (read)05101520253035Size (GB)Single client reads with dd700600500400Ganesha v3/tcp (read)300knfsd v3/tcp (read)Throughput (MB/s)200100LAD2013005101520253035Size (GB)CEA September, 16-17, 2013

BENCH : GANESHA VS KNFSD (DD WRITE)single client writes with dd700600MB/s500400Lustre natif (write)300Ganesha v3/tcp (write)200knfsd v3/tcp (write)100005101520253035Size (GB)single client writes via dd600500Throughput (MB/s)400300Ganesha v3/tcp (write)200knfsd v3/tcp (write)100LAD20130051015Size (GB)20253035CEA September, 16-17, 2013

BENCH : GANESHA VS KNFSD (IOR READ)Multiple clients read via IOR200000150000Lustre natif (read)100000MB/sGanesha v3/tcp (read)50000knfsd v3/tcp (read)005101520253035#clientsMB/sRead via IOR on several 00040000200000Ganesha v3/tcp (read)knfsd v3/tcp (read)0LAD201351015#clients20253035CEA September, 16-17, 2013

BENCH : GANESHA VS KNFSD (IOR WRITE)Multiple clients write with IOR60005000MB/s40003000Lustre natif (write)2000Ganesha v3/tcp (write)1000knfsd v3/tcp (write)005101520253035#clientsWrite via IOR on several clients250020001500Ganesha v3/tcp (write)MB/s1000knfsd v3/tcp (write)500005101520253035#clientsLAD2013CEA September, 16-17, 2013

COMMENTS ABOUT IO BENCHMARKSGanesha and knfsd have similar single clientperformancesKnfsd is faster on write (about 7% better)Ganesha is faster on read (about 3% better)Read operations are strongly impacted byLustre’s cachesNFS client caches Ganesha is interesting in clustered environmentGanesha’s performances are about 30% better than knfsd when multipleclients do write operations on the same serverRead operations suffer from by huge cache effectsRead operations (with both server) are faster than LUSTRE reads!!!!!!!Both Ganesha and knfsd behave the same wayLAD2013CEA September, 16-17, 2013

COMMENTS ABOUT METADATA BENCHMARKSGanesha accesses objects “by fid”NFS file handles carries the lustre FID for the related objectGanesha builds the related path in /mountpoint/.lustreGanesha then uses this “fid path” to access the objectThe knfsd is in the kernel space but Ganesha is in user space.Information is to be moved from kernel space to GaneshaLustre seems to behave differently as object are accessed by path or by FIDAny comment in the room ? Feedback is wanted on this point.Both Ganesha and knfsd run on a single clientTheir performances will never exceed those of a single clientUsing pNFS will break this bottleneckA single client in Lustre 2.1 suffers from “single shared file” issue as multipleaccess are done to a single file with direct impact to NFS performancesSee LU1666, LU1669, LU2481 (mostly fixed in 2.1.5)LAD2013CEA September, 16-17, 2013

FEEDBACK FROM PRODUCTION CASESGanesha is used in production at CEAGanesha exports HPSS namespace (metadata only) on TERA and TGCCGanesha exports LUSTRE (full rw access) on TGCCPart of the compute machine used an obsolete kernel (no LUSTRE)NFSv3 was used as a fallbackGanesha was providing NFS shares in RWWe know Ganesha can be used in HPC environment : we did use itWhat about crashes ?Ganesha resides in the user spaceNFSv3 is a stateless protocolNFSv4.x has client recovery featuresIf the daemon crashes just restart it and continue workingBig issue related to knfsdDepending on some access patterns, knfsd could generate lbugsIf knfsd crashes, it crashes the whole node and you need to rebootLAD2013CEA September, 16-17, 2013

AS A CONCLUSIONGanesha’s development is continuingMore NFSv4.x feature including more acl support and delegationsupportMore pNFS for LUSTRELUSTRE changelogs to implement Upcalls for FSAL LUSTRESupport for NFS/RDMA- Ganesha already have RDMA support for 9p2000.LGanesha is stable enough to be used in productionGanesha keeps good performances against many clientsUser Space is a nice placeEasy access to many services (kerberos, idmapper, dns, )Make it easy to build a sandboxIt’s easier to update a daemon than a kernel elementSecurityGanesha has efficient NFS/krb5 support via RPCSEC GSSWe will make Ganesha capable of running as a non-root user- service will be restricted to NFSv4.x and 9p2000.LLAD2013CEA September, 16-17, 2013

QUESTIONS ?LAD2013CEA September, 16-17, 2013

GPFS, SNMP, CEPH, HPSS, LUSTRE, VFS) FSAL UP File Content layer Cache Inode Cache Inode UP RPC Dispatcher threads Dup Req Layer RPCSEC_GSS I NFS V3 NFSv4.x/pNFS MOUNTv3 NLMv4, RQUOTA . GANESHA VS KNFSD (METADATA 1/3) RESULTS OF MDTEST: directory create/stats/rm 0 20 40 60 80 0 1000 2000 3000 4000 5000 #clients /s inode/s by number of clients