DriveScale-Cloudera Reference Architecture

Transcription

DriveScale-ClouderaReference Architecture

Drivescale-ClouderaREFERENCE ARCHITECTURETable of Contents1.Executive Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.Audience and Scope. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.DriveScale Cloudera Enterprise Solution Overview. . . . . . . . . . . . . . . . . . . . . . . . 44.DriveScale Components Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.1 Hardware: DriveScale SAS Adapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.Reference Architecture Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75.1 Physical Cluster Component List. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75.2 Logical Cluster Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85.3 Physical Cluster Topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.4 Cluster Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.5 Enabling Hadoop Virtualization Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.6 Disk and Filesystem Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.7 OS Supportability/Compatibility Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.8 JBOD Supportability/Compatibility Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.Rack Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2019 DriveScale Inc. All Right Reserved.2 of 54

Drivescale-ClouderaREFERENCE ARCHITECTURE7.References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208.Bill of Materials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219.Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2110. Appendix A: Glossary of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2211. Appendix B: DriveScale Cluster Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2411.1 Configure Your Domains with DriveScale Central (DSC). . . . . . . . . . . . . . . . . . 2411.2 Set up DriveScale Composer nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2611.3 Set up DriveScale Adapter (DSA). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2611.4 Start the Composer login to Composer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2711.5 Set up Servers/DataNodes/MasterNodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2811.6 Tagging JBOD and drives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2811.7 Creating Server Nodes and Clusters from templates . . . . . . . . . . . . . . . . . . . . 2911.7.1 Creating Node and Cluster Template. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2911.7.2 Creating Cluster Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3411.7.2 DriveScale Cluster Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3412. Appendix B: Cloudera Manager Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3512.1 Cloudera Manager Installation Procedure for Reference Architecture . . . . . . 36 2019 DriveScale Inc. All Right Reserved.3 of 54

REFERENCE ARCHITECTUREDRIVESCALE-CLOUDERA1. Executive SummaryThis document is a high-level design reference architecture guide for implementing ClouderaEnterprise on a DriveScale solution with industry standard servers and JBOD.The reference architecture introduces all the high-level hardware, and software that are included inthe stack. Each high-level component is then described individually. This reference architecture doesnot describe the Cloudera data components or their applications.DriveScale Technology OverviewDriveScale is leading the charge in bringing hyperscale computing capabilities to mainstreamenterprises. It’s composable data center architecture transforms rigid data centers into flexibleand responsive scale-out deployments. Using DriveScale, data center administrators can deployindependent pools of commodity compute and storage resources, automatically discover availableassets, and combine and recombine these resources as needed. The solution is provided througha set of on-premises and SaaS tools that coordinate between multiple levels of infrastructure. WithDriveScale, Hadoop architects can more easily support Hadoop deployments of any size as well asother modern application workloads.DriveScale provides hardware and software technology that allows separate deployment of computeand storage using commodity servers with minimal drives for Operating System and JBODs (Justa Bunch of Disks), with flexible binding of storage-to-compute resources in any ratio required by anapplication. As needs change, these bindings can be dissolved and reconfigured on demand, allunder software control.DriveScale makes it possible for data-driven companies to deploy high scale, high performanceserver infrastructure through composability for data-intensive computing. The DriveScale ComposablePlatform enables agility at cloud scale by creating flexible pools of disaggregated heterogeneouscompute nodes and storage systems and composing them into secure, highly available serversoptimized to each workload. DriveScale empowers IT to quickly and cost-effectively expand their dataand analytics deployments and drive revenue.2. Audience and ScopeThis reference architecture guide is for Hadoop and IT architects who are responsible for the designand deployment of Cloudera Enterprise solutions on premises, as well as for Apache Hadoopadministrators and architects and data center architects/engineers who collaborate with specialists inthat space.3. DriveScale Cloudera Enterprise Solution OverviewApache Hadoop is designed to address the ever so changing hardware requirements from customersfor a more flexible and dynamic hardware infrastructure that provides significant cost and operationalbenefits. It is designed with composability as the primary goal, saving money, improving utilization andgreatly simplifying the deployment of Hadoop clusters. 2017 DriveScale Inc. All Right Reserved.4 of 54

Drivescale-ClouderaREFERENCE ARCHITECTUREHadoop is an Apache project being developed in the Java programming language by a globalcommunity of contributors. Yahoo!, has been the largest contributor to this project, and uses ApacheHadoop extensively across its businesses. Core committers on the Hadoop project include employeesfrom Cloudera, eBay, Facebook, Getopt, Hortonworks, Huawei, IBM, InMobi, INRIA, LinkedIn, MapR,Microsoft, Pivotal, Twitter, UC Berkeley, VMware, WANdisco, and Yahoo!, with contributions from manymore individuals and organizations.Although Hadoop is popular and widely used, installing, configuring, and running a productionHadoop cluster involves many concerns, including: Choosing the appropriate Hadoop software distribution and extensions Installing monitoring and management software Allocation of Hadoop services to physical nodes Selection of appropriate server hardware Rightsizing the storage configuration Implementing data locality Design of the network fabric Sizing and system scalability Overall performanceThese concerns are complicated by the need to understand the workloads that will be running on thecluster, the fast-moving pace of the core Hadoop project, and the challenges to managing a systemdesigned to scale to thousands of nodes in a single cluster.The DriveScale Cloudera Solution was designed by DriveScale in collaboration with Cloudera, andembodies all the hardware, software, resources and services needed to run Hadoop in a productionenvironment. This end-to-end solution approach means that you can be in production with Hadoop ina shorter time than is typically possible with homegrown solutions. The solution is based on ClouderaEnterprise Data Hub 5.x (including Cloudera Distributed Hadoop), DriveScale hardware and software,industry standard servers, network switches and JBODs.This solution includes components that span the entire solution stack: Reference architecture and best practices Optimized storage configurations Optimized network infrastructure Cloudera Enterprise Data Hub including Cloudera Distributed HadoopThis solution is designed to address the clear majority of Apache Hadoop use cases including, but notlimited to: 2019 DriveScale Inc. All Right Reserved.Big data analytics5 of 54

Drivescale-ClouderaREFERENCE ARCHITECTURE ETL Offload Data Warehouse Optimization Batch processing of unstructured data Big data visualization Search and predictive analysis4. DriveScale Components OverviewDriveScale system is composed of one hardware component and four software components whichare described below:4.1 Hardware: DriveScale SAS AdapterThis is a 1U appliance with adapters that connect to servers via 10Gb Ethernet interfaces and toJBOD’s via SAS interfaces.4.2 SoftwareThere are four principal components of the DriveScale software:a)b) 2019 DriveScale Inc. All Right Reserved.DriveScale Composer The server running the DriveScale Composer (Composer) bundle is called the Composer node. A typical deployment consists of three Composer's in a clustered configuration for high availability(HA). The software manages and configure resources and contains the inventory/configurationinformation repository and database:3Inventory: Composer's, DS Adapters, switches, JBOD chassis, disks, server nodes3Configuration: node templates, cluster templates, configured clusters3Composer's Database: used as a message bus to communicate with the end points.DriveScale Server Agent6 of 54

Drivescale-ClouderaREFERENCE ARCHITECTURE c)DriveScale Server Agent discovery action provides inventory for hardware and servers, and createsmappings between server nodes and the disks they consume.DriveScale Server AgentCloud-based software management portal that acts as the:d)osoftware distribution repositories for subscribersoDriveScale keys repositoryocentralized log file repositoryouser documentation repositoryolicense managerDriveScale SAS Adapter Software DriveScale SAS Adapter software enables the JBODs to be mapped to the servers and over thenetwork to be used as local drives.5. Reference Architecture Details5.1 Physical Cluster Component ListThe following table lists the physical components for the riveScale SASAdapterDHCP, Jumbo frameenabled1U appliance withadapters that connectto servers via Ethernet,and to JBOD’s via SAS.2DriveScale AdapterDHCP, Jumbo frameenabledProvides the datanetwork.4 for each chassisDriveScale ComposerDriveScale Composerrunning as a VMManages andconfigures the nodesand cluster and alsostores the inventory/configurationrepository of everyhardware in thecluster.Min 1, for HA3 DriveScaleComposer's should beconfigured as masterand slaveServers2 socket CPU andmemory according tothe individual Hadoopcluster requirementsCommodity x86servers that house allthe Node Manager,compute instancesand DriveScaleagents.Min 3 Master nodes 5 Data nodesHDD for Servers2 drives configured inRAID 1The internal drives areused for OS install.2 for each server 2019 DriveScale Inc. All Right Reserved.7 of 54

Drivescale-ClouderaREFERENCE ityNICsDual- port 10 GbpsEthernet SFP NICs.Provides the datanetwork1 for each serverJBOD ChassisDefault configurationHouses the drive withdual IO controllers.Min 2, Recommended3 for productionenvironment byClouderaHDD for JBODDefault configurationDrives to house thedata for the cluster.Depending on thecluster requirementsToR 10G switchLLDP, MLAG, 9K JumboFrame configuredProvides data networkconnectivity.2 for each rackToR 1G switchDefault configurationProvides managementnetwork connectivity.1 for each rack5.2 Logical Cluster TopologyThe minimum requirements to build out the cluster are: 3 Master Nodes 5 Data Nodes 1 DriveScale SAS Adapter 1 DriveScale Composer 2 10G Switches 1 1G Switch 2 JBOD’s chassis with drivesThis reference architecture is built on 3 master nodes and 5 data nodes with 2 JBOD chassis and 126drives of 1or 2 or 3 TB HDD. The following table lists the configurations of the servers and number ofdrives used.For clusters that require the maximum read bandwidth out of each attached drive concurrently, itis recommended that the nodes in such a cluster be configured with a maximum of 8 drives each,assuming 2 x 10Gbps Ethernet bandwidth per node. However, this is an extreme case. A general ruleof thumb for calculating the number of drives to allocate to each node in a cluster is dependent onthe application but it is safe to allocate up to 16 drives per node, again assuming 2 x 10Gbps Ethernetbandwidth per node.With the availability of quad-port 10Gbps Ethernet adapters, one can add significantly higher I/O pernode and therefore greater numbers of drives as well. 2019 DriveScale Inc. All Right Reserved.8 of 54

Drivescale-ClouderaREFERENCE ityMaster nodes2 sockets 8 core CPU, 64GBRAM, 10GbE Intel NIC with 2internal HDD for OS and 4 highcapacity HDD mounted fromthe JBOD.Master nodes hosts the Clouderamaster services and DriveScaleagents.3Worker nodes2 sockets 8 core CPU, 64GBRAM, 10GbE Intel NIC with 2internal HDD for OS and 16high capacity HDD mountedfrom the JBOD. For Impalanodes, the minimum RAMshould be 128GB.Data nodes house the HDFS DataNodes and YARN Node managers,any additional required servicesand DriveScale agents.5Notes:- Customers with higher (or lower) compute needs can acquire bigger (or smaller) data nodesconfigured with CPU and memory that fits the specific requirements of their applications.- Similarly, depending on the data requirements, customers can add or remove disk drives to matchthe specific needs of their applications.The following table identifies service roles for different node types.Master NodeMaster NodeMaster ceManagerResourceManagerHistory ServerHiveManagement(misc)Navigator 2019 DriveScale Inc. All Right Reserved.Worker NodeNode ManagerMetaStore,WebHCat,HiveServer2Cloudera AgentCloudera AgentCloudera Agent,Oozie, ClouderaManager,ManagementServicesCloudera AgentNavigator, KeyManagementServices9 of 54

Drivescale-ClouderaREFERENCE ARCHITECTUREMaster NodeMaster NodeHUEHBASEMaster NodeWorker NodeHUEHMasterHMasterImpalaHMasterRegion ServerStateStore,CatalogImpala DaemonSearchSolrKafkaBrokerSparkHistory ServerHDFSNameNode, QJNRuns on YARNNameNode, QJNQJNDataNode5.3 Physical Cluster TopologyDiagram 1: DriveScale lab Architecture with 2x Adapter Chassis (8x Adapters in use), 2x JBOD,3 Master Nodes and 5 Data Nodes 2019 DriveScale Inc. All Right Reserved.10 of 54

Drivescale-ClouderaREFERENCE ARCHITECTUREDiagram 2: DriveScale lab Architecture with 3xDSA Chassis (12x Adapters in use), 2x JBOD, 3 MasterNodes and 5 Data NodesNotes:- The 1GbE management connections were made to 10GbE switches, Adapter's and JBOD’s. Theconnections are omitted in the diagram to ease readability.- 1GbE connection is used only for server management purpose with BMC IDRAC (Dell) or iLO (HPE).It is not a part of the Hadoop network. Please note that multi-homed clusters are not supported byCloudera.- SAS connections from DriveScale SAS Adapter chassis 2 to JBOD 2 was also replicated as Adapterchassis 1 to JBOD 1. The connections are omitted in the diagram to make it look less congested.- The drives to the master nodes and data nodes were distributed across the two JBOD’s chassis. 2019 DriveScale Inc. All Right Reserved.11 of 54

Drivescale-ClouderaREFERENCE ARCHITECTURE5.4 Cluster ManagementThis section details the steps for setting up a DriveScale enabled Hadoop cluster using Clouderamanager.Setting up DriveScale clusterBefore installing Cloudera Manager or using an existing install of Cloudera Manager, you mustcomplete the following tasks for setting up the DriveScale solution:1.Rack and install the DriveScale SAS Adapter and controllers (Adapters) using the documentationprovided by DriveScale.2.Rack and install the JBOD’s using the documentation provided by the vendor.3.Rack and install the server using the documentation provided by the vendor.4.Create RAID 1 for the internal HDD on the server and install the OS on all the servers.5.Install and configure DriveScale Composer either as a VM or on a standalone server.6.Set up DriveScale SAS Adapter configuration from DriveScale Composer.7.Install and configure DriveScale agents on the master and data nodes.8.Create master/data node and cluster template with required drives using DriveScale Composer.9.Create the cluster from the template using DriveScale Composer.10. Ensure that DriveScale cluster is up and running before proceeding ahead. 2019 DriveScale Inc. All Right Reserved.12 of 54

Drivescale-ClouderaREFERENCE ARCHITECTURESetting up Cloudera cluster1.After the successful completion of the steps mentioned above, install Cloudera Manager using theCloudera CDH Installation guide.2.Ensure that Cloudera HDFS cluster is set up in a high availability mode.3.The following services were set up for this reference architecture. keeperEnsure that the master and data nodes are up and running with the right assigned roles andstorage. 2019 DriveScale Inc. All Right Reserved.13 of 54

Drivescale-ClouderaREFERENCE ARCHITECTURE5.5 Enabling Hadoop Virtualization ExtensionsWith DriveScale solution, we enable configuration of a highly available Hadoop cluster including rackawareness. Hadoop Virtualization Extensions (HVE), enables customers to get additional capabilitiesfor failure mitigation and rack awareness thereby enabling the cluster to survive the worst-casescenario of total power or hardware failure of any component including JBOD failures for an extendedperiod. HVE can be enabled in Cloudera Manager. To enable HVE, follow the documentation on HVEfrom Cloudera. Also, below are the steps we followed for this reference architecture.For this reference architecture, below are the name details of the master and data nodes:Node TypesServer NamesMaster .drivescale.comu34.data1.r3.hq.drivescale.comData e.com1.Go to the Cloudera Manager.a)Configure the following safety valves based on your environment:oHDFS hdfs coresite.xml: property name net.topology.impl /name value org.apache.hadoop.net.NetworkTopologyWithNodeGroup /value /property property name net.topology.nodegroup.aware /name value true /value /property property name dfs.block.replicator.classname /name value kPlacementPolicyWithNodeGroup /value /property 2019 DriveScale Inc. All Right Reserved.14 of 54

Drivescale-ClouderaREFERENCE ARCHITECTUREoYARN YARN Service MapReduce Advanced Configuration Snippet (Safety Valve), add thefollowing properties and values: property name mapred.jobtracker.nodegroup.aware /name value true /value /property property

REFERENCE ARCHITECTURE 2017 DriveScale Inc. All Right Reserved. DRIVESCALE-CLOUDERA 1. Executive Summary This document is a high-level design reference architecture guide for implementing Cloudera Enterprise on a Dri