IBM Spectrum Scale HDFS Support - Files.gpfsug

Transcription

IBM Spectrum ScaleHDFS SupportIBM Spectrum Scale German User Meeting 2017Mar 8th 9th, 2017Ulf Troppens

http://files.gpfsug.org/presentations/2016/SC16/06 - Carrie Spear - Spectrum Sclale and HDFS.pdf

Essentials The default storage for Hadoop is HDFSHDFS is Hadoop Distributed File System which runs on storage rich servers (storageinternal to servers)Spectrum Scale provides a HDFS connector and allows existing Hadoop applicationsto run directly on Spectrum ScaleThis enables customers to create complex analytics workflows, minimize datamovement & copies and speed up time to insightHDFS is a shared nothing architecture, which is very inefficient for high throughputjobs (disks and cores grow in same ratio)Costly data protection: uses 3-way replication; limited RAID/erasure codingWorks only with Hadoop i.e weak support for File or Object protocolsClients have to copy data from enterprise storage to HDFS in order to run Hadoopjobs, this can result in running on stale data. The Spectrum Scale Transparency Connector brings analytics to the data.IBM Systems 3

Unleash New Storage Economics on a Global Scale SAP Oracle SAS etc. DB2 MQ MainfraHadoop ComputeFarmUsers and applicationsmeClient workstationsSingle name 3SwiftCinderDockerGlanceManilaIBM Spectrum ScaleSite AAutomated data placement and data migrationSite BActive Hot Warm “Online”Archive Cold “Offline” Off Premise Site CStorage RichServersFlashDiskTapeObject StorageMulti-cloudStorageToolkitIBM Systems 4

Store everywhere. Run anywhere.Analytics without complexityChallenge Separate storage systems for ingest, analysis, resultsIngestoHDFS requires locality aware storage (namenode)oData transfer slows time to resultsoDifferent frameworks & analytics tools use data differentlyAnalysisRaw Data HDFS Transparency––––––Map/Reduce on shared, or shared nothing storageNo waiting for data transfer between storage systemsImmediately share resultsSingle ‘Data Lake’ for all applicationsEnterprise data managementArchive and Analysis in-placeDirect AccessObjectFile Analyze object and file data without copyinginto HDFS Copyright IBM Corporation 2015POSIXIBM Systems 5

Backup Of Large Spectrum Scale File SystemsFunctionSpectrum Protectbackup archive clienttypically installed onserveral cluster nodesSpectrum Scalemmbackup toolcoordinates processingbackup (mmbackup)Spectrum Scale Clusterrestore (GUI or CLI)Spectrum ProtectServerMassive parallel filesystem backupprocessingSpectrum Scale mmbackup createslocal shadow of Spectrum ProtectDB and uses policy engine toidentify files for backupSpectrum Protect backup archiveclient is used under the hood tobackup files to Spectrum ProtectServerSpectrum Protect restore (CLI orGUI) can be used to restore files Use any backup program to backup file, object and Hadoop data Use Spectrum Protect to benefit from mmbackup and SOBAR to backup andrestore huge amounts of data Copyright IBM Corporation 2015

IBM Delivers New Platform to Help ClientsAddress Storage Challenges at Massive ScaleLas Vegas, NV (IBM PartnerWorld) – 14 Feb 2017: IBM (NYSE: IBM) and Hortonworks(NASDAQ: HDP) today announced the planned availability of Hortonworks Data Platform(HDP ) for IBM Elastic Storage Server (ESS) and IBM Spectrum Scale. The agreement withHortonworks will lead to certification of Hortonworks HDP on Power with IBM Spectrum Scaleand Hortonworks HDP on x86 with IBM Spectrum challenges-massive-scale/IBM Systems 7

A Tale of Two ConnectorsGPFS Hadoop ConnectorSpectrum Scale HDFS Transparency Connector Henceforth known as the “old” connectorEmulates a Hadoop compatible filesystem – i.e.replaces HDFSStatelessFree download – linkSupports Spectrum Scale 4.1.x, 4.1.1.x and 4.2Currently supported with IOP 4.0.x and 4.1.xIntegrated with Ambari (IOP 4.1.x) Copyright IBM Corporation 2015 Henceforth known as the “new” connectorIntegrates with HDFS – reuses HDFS client andimplements NameNode and DataNode RPCsStatelessFree download – linkSupports Spectrum Scale 4.1.x, 4.1.1.x and 4.2Planned for IOP 4.2 (April timeframe)Ambari integration supported with IOP 4.2

Old GPFS Hadoop Connector ApproachHow can we be sure we’re compatible?Hadoop File System API intended to be open.public abstract classorg.apache.hadoop.fs.FileSystemSource: hadoop.apache.org“All user code that may potentially use the Hadoop Distributed FileSystem should be written to use a FileSystem object.”Latest File System APIs are described g/apache/hadoop/fs/FileSystem.html Copyright IBM Corporation 2015

Old GPFS Hadoop Connector ApproachAll based onorg.apache.hadoop.fs.FileSystem APISpectrumScale(GPFS) is nodifferentSource: https://wiki.apache.org/hadoop/HCFS Copyright IBM Corporation 20153/15/201710

Old GPFS Hadoop Connector ApproachApplications communicate with Hadoop using FileSystem API.Therefore, transparency is preserved.API levelHadoop ApplicationHadoop FileSystem APIExt4DiskDiskHadoop FileSystem APIHadoop levelHDFSKernel levelfile systemDiskHadoop AplicationGPFS Hadoop ConnectorSpectrum Scale FPODiskDiskDisk“All user code that may potentially use theHadoop Distributed File System should bewritten to use a FileSystem object.” Copyright IBM Corporation 2015Source: hadoop.apache.org3/15/201711

New Spectrum Scale HDFS Transparency DesignApplicationsHigher-level languages:Hive, BigSQL JAQL, Pig hdfs://hostnameX:portnumberHadoop clientHadoop clientHadoop clientHadoop FileSystemAPIHadoop FileSystemAPIHadoop FileSystemAPIHDFS ClientHDFS ClientHDFS ClientMap/Reduce APIHDFS ClientHDFS RPCSpectrum ScaleConnector ServerHDFS RPCover networkConnectorHadoop FS APIsGPFS ConnectorServiceGPFS ConnectorServiceConnector onlibgpfs,posix APIConnector onlibgpfs,posix APIGPFS nodeSpectrum ScaleGPFS nodePowerLinuxPowerLinuxSupported Hadoop versions: 2.7.1Commodity hardware Copyright IBM Corporation 2015Shared storage3/15/201712

New Spectrum Scale HDFS Transparency Design Each node will be installed with connector datanode serverOnly one node will be installed with connector namenode serverConnector namenode server will be configured with HA, just similar as HDFSGA’ed de: portnumber HDFS RPC overnetworkConnectorNamenode ServiceConnectorDatanode ServiceConnectorDatanode ServiceGPFS FPOGPFS FPOGPFS FPO Copyright IBM Corporation 2015GPFS/FPO clusterHadoop cluster3/15/201713

New Spectrum Scale HDFS Transparency Design Connector servers are installed over limitednodes (ex. GPFS NSD servers) GPFS client is not needed over the Hadoopcomputing nodes DNS rotation or CES can be used to load balancefor HDFS Client GA’ed for 2016/1/22 Copyright IBM Corporation 20153/15/201714

New Spectrum Scale HDFS Transparency Design Key Advantages–––––Support workloads that have hard coded HDFS dependenciesSimpler integration for currently compatible workloads & componentsLeverage HDFS Client cache for better performanceNo need to install Spectrum Scale clients on all nodesFull Kerberos support for Hadoop ecosystem Coming soon––––BigInsights 4.2 support (additional components)HDFS Spectrum Scale FederationFederate multiple Spectrum Scale clustersIsolate multiple Hadoop clusters on the same filesystem (restrict to sub-directory) Copyright IBM Corporation 20153/15/201715

Current Ambari Integration New BigInsights 4.1.SpectrumScale stack Inherits from BigInsights 4.1 stack Removes HDFS, add Spectrum Scale,change all dependencies Can install IOP Spectrum Scale (eithernew GPFS filesystem or integrate withexisting filesystem) Value Add integration Basic Spectrum Scale monitoring (AMS) Support separate connector control Support GPFS and connector upgrades Collect GPFS snap Change GPFS parameters Add new nodes Remove nodes Provide quick link to Spectrum Scale GUI forfull management and monitoring Copyright IBM Corporation 20153/15/201716

Current Ambari Integration Copyright IBM Corporation 20153/15/201717

Ambari Integration with HDFS Transparency Biggest change is that there is no new stackSpectrum Scale is added as a new service after full IOPinstall with HDFS (use dummy directory / mount point forHDFS)Spectrum Scale service “integrates” with HDFSWill support “un-integrate” capability Flip back and forth between HDFS & GPFSWill not move data back and forth between HDFS & GPFSWill simplify future upgrades Copyright IBM Corporation 20153/15/201718

References Spectrum Scale Knowledge STXKQY 4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adv hadoop.htm IBM ommunity/wikis/home?lang r%20Hadoop Deployment Guide and other useful unity/wikis/home?lang GPFS%29/page/References?section HDFSTIGIBM Systems

Hadoop FS APIs Higher-level languages: Hive, BigSQL JAQL, Pig Applications Supported Hadoop versions: 2.7.1 HDFS Client Spectrum Scale HDFS RPC Hadoop client Hadoop FileSystem API Connector on libgpfs,posix API Hadoop client Hadoop FileSystem API Connector on libgpfs,posix API GPFS node Ha