Lenovo Big Data Validated Design For Cloudera Enterprise .

Transcription

Lenovo Big Data ValidatedDesign for ClouderaEnterprise with Local andDecoupled SAS StorageLast update: 24 October 2018Version 1.3Configuration Reference Number: BDCLDRXX83Solution based on theReference architecture forCloudera Enterprise with Apache ThinkSystem SR650 server,bare-metal and virtualizedHadoop and Apache SparkDeployment considerations forscalable racks including detailedvalidated bills of materialSolution based on ThinkSystemSD530 compute node with D3284SAS storage expansion enclosureDan Kangas (Lenovo)Weixu Yang (Lenovo)Ajay Dholakia (Lenovo)Dwai Lahiri (Cloudera)Click here to check for updates1Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

Table of Contents1Introduction . 52Business problem and business value. 63Requirements . 8Functional Requirements . 8Non-functional Requirements. 84Architectural Overview . 9Cloudera Enterprise . 9Bare-metal Cluster - Local and External SAS Storage (JBOD) . 9Virtualized Cluster with VMware vSphere . 115Component Model . 12Cloudera Components . 13Apache Spark on Cloudera . 156Operational Model . 17Hardware Description . 176.1.1Lenovo ThinkSystem SR650 Server . 176.1.2Lenovo ThinkSystem SR630 Server . 186.1.3Lenovo ThinkSystem SD530 Compute Server . 196.1.4Lenovo RackSwitch G8052 . 196.1.5Lenovo RackSwitch G8272 . 206.1.6Lenovo RackSwitch NE2572 . 206.1.7Lenovo RackSwitch NE10032 . 216.1.8Lenovo D3284 SAS Expansion Enclosure . 22Cluster Node Configurations . 226.2.1Worker Nodes . 236.2.2Master and Utility Nodes . 246.2.3System Management and Edge Nodes . 266.2.4External SAS Storage Node . 26Cluster Software Stack . 286.3.12Cloudera Enterprise CDH . 28Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

6.3.2Red Hat Operating System. 29Cloudera Service Role Layouts. 29System Management . 31Networking . 326.6.1Data Network . 336.6.2Hardware Management Network . 336.6.3Multi-rack Network . 346.6.410Gb and 25Gb Data Network Configurations . 35Predefined Cluster Configurations . 3676.7.1SR650 Configurations . 376.7.2SD530 with D3284 Configurations . 396.7.3Cluster Storage Capacity. 406.7.4Storage Tiering with NVMe and SSD Drives . 426.7.5D3284 Storage Tiering . 426.7.6SD530 and D3284 Configuration Options . 44Deployment considerations . 45Increasing Cluster Performance. 45Processor Selection . 457.2.1SR630/SR650 Processors. 467.2.2SD530 Processors . 46Designing for Storage Capacity and Performance . 467.3.1Node Capacity . 467.3.2Node Throughput . 467.3.3HDD Controller . 47Memory Size and Performance. 47Data Network Considerations . 49Designing with Hadoop Virtualized Extenstions (HVE) . 507.6.1Enabling Hadoop Virtualization Extensions (HVE) . 50Cloudera VMware Virtualized Configuration . 527.7.1Cluster Software Stack . 527.7.2ESXi Hypervisor and Guest OS Configuration: . 52Estimating Disk Space . 53Scaling Considerations . 547.9.1Scaling D3284 External SAS JBOD Storage . 547.9.1Scaling D3284 Storage and SD530 Compute Independently . 55High Availability Considerations . 557.10.13Network Availability . 55Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

7.10.2Cluster Node Availability . 567.10.3Storage Availability . 567.10.4Software Availability . 56Linux OS Configuration Guidelines . 577.11.1OS configuration for Cloudera CDH . 577.11.2OS Configuration for SAS Multipath . 57Designing for High Ingest Rates . 598Bill of Materials - SR650 Nodes . 60Master Node . 60Worker Node . 61System Management Node. 63Management Network Switch . 64Data Network Switch . 64Rack . 64Cables . 659Bill of Materials - SD530 with D3284 . 66Master Node . 66Worker Node . 67Systems Management Node . 68External SAS Storage Enclosure . 69Management Network Switch . 70Data Network Switch . 70Rack . 71Cables . 71Software . 7110 Acknowledgements . 7211 Resources . 73Document history . 754Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

1 IntroductionThis document describes the reference architecture for Cloudera Enterprise on bare-metal with locallyattached storage and with decoupled compute and storage, and on a virtualized platform with VMwarevSphere. It provides a predefined and optimized hardware infrastructure for the Cloudera Enterprise, adistribution of Apache Hadoop and Apache Spark with enterprise-ready capabilities from Cloudera. Thisreference architecture provides the planning, design considerations, and best practices for implementingCloudera Enterprise with Lenovo products.Lenovo and Cloudera worked together on this document, and the reference architecture that is describedherein was validated by Lenovo and Cloudera.With the ever-increasing volume, variety and velocity of data becoming available to an enterprise comes thechallenge of deriving the most value from it. This task requires the use of suitable data processing andmanagement software running on a tuned hardware platform. With Apache Hadoop and Apache Sparkemerging as popular big data storage and processing frameworks, enterprises are building so-called DataLakes by employing these components.Cloudera brings the power of Hadoop to the customer's enterprise. Hadoop is an open source softwareframework that is used to reliably manage large volumes of structured and unstructured data. Clouderaexpands and enhances this technology to withstand the demands of your enterprise, adding management,security, governance, and analytics features. The result is that you get a more enterprise ready solution forcomplex, large-scale analytics.VMware vSphere brings virtualization to Hadoop with many benefits that cannot be obtained on physicalinfrastructure or in the cloud. Virtualization simplifies the management of your big data infrastructure, enablesfaster time to results and makes it more cost effective. It is a proven software technology that makes itpossible to run multiple operating systems and applications on the same server at the same time.Virtualization can increase IT agility, flexibility, and scalability while creating significant cost savings.Workloads get deployed faster, performance and availability increases and operations become automated,resulting in IT that is simpler to manage and less costly to own and operate.The intended audience for this reference architecture is IT professionals, technical architects, salesengineers, and consultants to assist in planning, designing, and implementing the big data solution withLenovo hardware. It is assumed that you are familiar with Hadoop components and capabilities. For moreinformation about Hadoop, see “Resources” on page 73.5Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

2 Business problem and business valueBusiness ProblemThe world is well on its way to generate more than 40 million TB of data by 2020. In all, 90% of the data in theworld today was created in the last two years alone. This data comes from everywhere, including sensors thatare used to gather climate information, posts to social media sites, digital pictures and videos, purchasetransaction records, and cell phone global positioning system (GPS) signals. This data is big data.Big data spans the following dimensions: Volume: Big data comes in one size: large – in size, quantity and/or scale. Enterprises are awash withdata, easily amassing terabytes and even petabytes of information. Velocity: Often time-sensitive, big data must be used as it is streaming into the enterprise to maximizeits value to the business. Variety: Big data extends beyond structured data, including unstructured data of all varieties, such astext, audio, video, click streams, and log files.Enterprises are incorporating large data lakes into their IT architecture to store all their data. The expectationis that ready access to all the available data can lead to higher quality of insights obtained through the use ofanalytics, which in turn drive better business decisions. A key challenge faced today by these enterprises issetting up an easy to deploy data storage and processing infrastructure that can start to deliver the promisedvalue in a very short amount of time. Spending months of time and hiring dozens of skilled engineers to piecetogether a data management environment is very costly and often leads to frustration from unrealized goals.Furthermore, the data processing infrastructure needs to be easily scalable in addition to achieving desiredperformance and reliability objectives.Big data is more than a challenge; it is an opportunity to find insight into new and emerging types of data tomake your business more agile. Big data also is an opportunity to answer questions that, in the past, werebeyond reach. Until now, there was no effective way to harvest this opportunity. Today, Cloudera uses thelatest big data technologies such as the in-memory processing capabilities of Spark in addition to the standardMapReduce scale-out capabilities of Hadoop, to open the door to a world of possibilities.Business ValueHadoop is an open source software framework that is used to reliably manage and analyze large volumes ofstructured and unstructured data. Cloudera enhances this technology to withstand the demands of yourenterprise, adding management, security, governance, and analytics features. The result is that you get anenterprise-ready solution for complex, large-scale analytics.How can businesses process tremendous amounts of raw data in an efficient and timely manner to gainactionable insights? Cloudera allows organizations to run large-scale, distributed analytics jobs on clusters ofcost-effective server hardware. This infrastructure can be used to tackle large data sets by breaking up thedata into “chunks” and coordinating data processing across a massively parallel environment. After the rawdata is stored across the nodes of a distributed cluster, queries and analysis of the data can be handledefficiently, with dynamic interpretation of the data formatted at read time. The bottom line: Businesses canfinally get their arms around massive amounts of untapped data and mine that data for valuable insights in amore efficient, optimized, and scalable way.6Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

Cloudera that is deployed on Lenovo System x servers with Lenovo networking components provides superiorperformance, reliability, and scalability. The reference architecture supports entry through high-endconfigurations and the ability to easily scale as the use of big data grows. A choice of infrastructurecomponents provides flexibility in meeting varying big data analytics requirements.There is growing interest in deploying Hadoop on a virtualized infrastructure driven by the promise of ease ofmanaging the cluster during initial deployment as well as adding more nodes when data storage andprocessing requirements grow. The ability to have virtualized Hadoop environment look and feel the same asit does on a bare-metal infrastructure allows flexibility in incorporating the solution within an enterprise’s datamanagement architecture.7Lenovo Big Data Validated Design for Cloudera Enterprise with Local and Decoupled SASstorage

3 RequirementsThe functional and non-functional requirements for this reference architecture are desribed in this section.Functional RequirementsA big data solution supports the following key functional requirements: Ability to handle various workloads, including batch and real-time analytics Indus

Oct 24, 2018 · reference architecture provides the planning, design considerations, and best practices for implementing Cloudera Enterprise with Lenovo products. Lenovo and Cloudera worked together on this document, and the