Dell EMC PowerStore: Apache Spark Solution Guide

Transcription

Technical White PaperDell EMC PowerStore: Apache Spark SolutionGuideAbstractThis document provides a solution overview for Apache Spark running on a DellEMC PowerStore appliance.June 2021H18663

RevisionsRevisionsDateDescriptionJune 2021Initial releaseAcknowledgmentsAuthor: Henry WongThis document may contain certain words that are not consistent with Dell's current language guidelines. Dell plans to update the document oversubsequent future releases to revise these words accordingly.This document may contain language from third party content that is not under Dell's control and is not consistent with Dell's current guidelines for Dell'sown content. When such third party content is updated by the relevant third parties, this document will be revised accordingly.The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in thispublication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.Use, copying, and distribution of any software described in this publication requires an applicable software license.Copyright 2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of DellInc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [6/9/2021] [Technical White Paper] [H18663]2Dell EMC PowerStore: Apache Spark Solution Guide H18663

Table of contentsTable of contentsRevisions.2Acknowledgments .2Table of contents .3Executive summary.5Audience .51Introduction .61.1PowerStore overview .61.2Apache Spark overview .61.3Apache Hadoop Distributed File System overview .81.4The advantages of Spark and Hadoop on PowerStore .91.4.1 AppsON brings applications closer to the infrastructure and storage .91.4.2 Agile infrastructure, flexible scaling on a high-performing storage and compute platform.91.4.3 Mission-critical high availability and fault-tolerant platform .91.4.4 PowerStore inline data reduction reduces storage consumption and cost .101.4.5 Efficient and convenient snapshot data backup .101.4.6 Secure data protection with ease of mind .101.4.7 Unified infrastructure and services management .101.4.8 Spark value and future expansion .111.5Terminology .112Sizing considerations .133Deploying a Spark cluster with HDFS .143.1Planning for the virtual machines that run Spark and Hadoop .143.1.1 PowerStore X model appliance .153.1.2 PowerStore storage containers and virtual volumes .163.1.3 Creating virtual machines on PowerStore X model appliance .183.2Installation and configuration of Apache Hadoop .223.2.1 Installing Hadoop .223.2.2 Configuring Hadoop HDFS cluster .243.3Installation and configuration of Apache Spark .283.3.1 Installing Spark .283.3.2 Configuring a Spark standalone cluster .303.3.3 Configuring Spark History Server .344Testing Spark with Spark-bench .364.13Installing Spark-bench tool .36Dell EMC PowerStore: Apache Spark Solution Guide H18663

Table of contents4.1.1 Installation prerequisites .364.1.2 Installing Spark-bench .374.2Running Spark-bench workloads.384.2.1 Generate KMeans dataset.394.2.2 Run KMeans workload .404.2.3 Spark memory and CPU cores .424.2.4 Spark network timeout .444.2.5 Monitoring Spark applications .445Interactive analysis of PowerStore metrics with Jupyter notebook .485.1Installing prerequisite software .485.1.1 JupyterLab .485.1.2 Python modules .495.1.3 PowerStore command-line interface (CLI) .49Extract PowerStore space metrics .505.3Import PowerStore space metrics into HDFS.505.4Perform analysis on the PowerStore space metrics.546Automation .587Data protection .597.1Snapshots and thin clones .597.2AppSync .607.3RecoverPoint for Virtual Machines .607.4Hadoop distributed copy and HDFS snapshots .60AConfigure passwordless SSH.61BPython codes .62C45.2B.1Import .csv files into HDFS .62B.2Analyze PowerStore space metrics .63Additional resources .67C.1Technical support and resources .67C.2Other resources .67C.3Ansible resources .67Dell EMC PowerStore: Apache Spark Solution Guide H18663

Executive summaryExecutive summaryApache Spark has seen tremendous growth in the past few years. It is the leading platform for big datadistributed processing because of its innovation, speed, and developer-friendly framework. This documentoffers a high-level overview of the Dell EMC PowerStore appliance and the benefits of running ApacheSpark and Hadoop HDFS on PowerStore. The document also provides installation, configuration, testing,and a simple use case for Spark and HDFS on PowerStore.AudienceThis document is intended for IT administrators, storage architects, partners, and Dell Technologies employees. This audience also includes individuals who may evaluate, acquire, manage, operate, or design aDell EMC networked storage environment using PowerStore systems.5Dell EMC PowerStore: Apache Spark Solution Guide H18663

Introduction1IntroductionThis document was developed using the PowerStore X model appliance, Apache Spark, Apache HDFS, andRed Hat Enterprise Linux . This section provides an overview for PowerStore, Apache Spark, and ApacheHDFS.1.1PowerStore overviewPowerStore achieves new levels of operational simplicity and agility. It uses a container-based microservicesarchitecture, advanced storage technologies, and integrated machine learning to unlock the power of yourdata. PowerStore is a versatile platform with a performance-centric design that delivers multidimensionalscale, always-on data reduction, and support for next-generation media.PowerStore brings the simplicity of public cloud to on-premises infrastructure, streamlining operations with anintegrated machine-learning engine and seamless automation. It also offers predictive analytics to easilymonitor, analyze, and troubleshoot the environment. PowerStore is highly adaptable, providing the flexibility tohost specialized workloads directly on the appliance and modernize infrastructure without disruption. It alsooffers investment protection through flexible payment solutions and data-in-place upgrades.The PowerStore platform is available in two different product models: PowerStore T models and PowerStoreX models. PowerStore T models are bare-metal, unified storage arrays which can service block, file, andVMware vSphere Virtual Volumes (vVols) resources along with numerous data services and efficiencies.PowerStore X model appliances enable running applications directly on the appliance through the AppsONcapability. A native VMware ESXi layer runs embedded applications alongside the PowerStore operatingsystem, all in the form of virtual machines. This feature adds to the traditional storage functionality ofPowerStore X model appliances, and supports serving external block and vVol storage to servers withmultiple protocols.For more information about PowerStore T models and PowerStore X models, see the documents Dell EMCPowerStore: Introduction to the Platform and Dell EMC PowerStore Virtualization Infrastructure Guide.1.2Apache Spark overviewApache Spark is an open-source distributed processing engine designed to be high performing, scalable, andcapable of computing massive amount of data. It can perform a wide range of analytic tasks such as SQLqueries, streaming, and machine learning.Spark supports several popular programming languages such as Java, Scala, Python, and R and provides aunified and consistent set of APIs for these programming languages. Also, it has an extensive set of librariesfor SQL (dataframes), machine learning (MLlib), Spark Streaming, and GraphX. These capabilities allowdevelopers to easily build Spark applications by combining different APIs, libraries, and functions.Spark is built for speed and high performance. Spark loads the entire dataset in memory on the cluster andperforms computation on it. The data is kept in memory to minimize disk access. Spark performsexceptionally well for iterative computations that require passing the same data multiple times. Machinelearning is a great example of such iterative computations.6Dell EMC PowerStore: Apache Spark Solution Guide H18663

IntroductionSpark supports a wide range of storage systems such as local file systems, Apache Hadoop HDFS, ApacheHive, Apache HBase, Cassandra, and more. Figure 1 shows the Spark components in blue. For moreinformation, see the corresponding documentation on sJavaPythonStreamingCore Spark CoreHDFSLocal filesystemHBaseHiveCassandraOthersSpark componentsSpark supports several cluster managers including Spark standalone cluster, Apache Hadoop YARN, ApacheMesos, and Kubernetes. This paper focuses on the Spark standalone cluster.7Dell EMC PowerStore: Apache Spark Solution Guide H18663

IntroductionA Spark standalone cluster (see Figure 2) consists of one master node and multiple worker nodes. Thecluster manager on the master node manages the cluster resources, such as CPUs and memory, andassigns application tasks to the worker nodes. A Spark application is a driver program that establishes aSpark session to the cluster manager and requests resources to perform multiple tasks on the worker nodes.Executors are the Java VMs (JVM) on the worker nodes that perform the tasks and report the status andresults back to the driver program.Spark Worker NodeTaskApplicationDriver kTaskSpark Worker NodeTaskSpark Master NodeExecutorSpark ClusterManagerTaskTaskTaskExecutorTaskTaskSpark standalone cluster overview1.3Apache Hadoop Distributed File System overviewApache Hadoop is an open-source software suite and framework for big-data processing. Hadoop DistributedFile System (HDFS) is one of the core components of Hadoop. It is a distributed file system designed to bemassively scalable, fault tolerant, and have high throughput. HDFS can scale up to hundreds of servers andsupports large files. It is well suited for applications, such as Spark, that require access to large datasets.HDFS files are divided into blocks and stored on multiple servers. Data block replication (replication factor)places replicas of the block across the cluster to increase data availability and read performance.Other Hadoop core components include the following: Hadoop YARN: A cluster and resource managerHadoop MapReduce: A distributed parallel data processing systemHadoop Common: Core common utilities shared by other modulesHDFS provides the persistent storage and data source for Spark. The configuration in this document does notuse Hadoop YARN and MapReduce.A Hadoop cluster consists of a NameNode and multiple DataNodes. The NameNode maintains the files anddirectory information of the distributed file system and tracks where the data blocks are located within thecluster. The DataNodes store the data blocks and the replicas in the local file systems.8Dell EMC PowerStore: Apache Spark Solution Guide H18663

Introduction1.4The advantages of Spark and Hadoop on PowerStoreBoth Spark and HDFS share a similar distributed architecture that requires a powerful, highly scalable, andflexible infrastructure. PowerStore is performance-optimized for any workload, and its adaptable scale-up andscale-out architecture complements the distributed model of these applications. This section highlights thePowerStore features that benefit and extend the application environment.1.4.1AppsON brings applications closer to the infrastructure and storageBringing applications closer to data increases density and simplifies infrastructure operations. ThePowerStore AppsON capability integrates with VMware vSphere , resulting in streamlined management inwhich storage resources plug directly into the virtualization layer. Using VMware as the onboard applicationenvironment results in unmatched simplicity, since support is inherently available for any standard VM-basedapplications. When a new PowerStore X model is deployed, the VASA provider is automatically registered,and the datastore is created, eliminating manual steps and saving time. PowerStore seamlessly integrates theVMware ESXi software into the same hardware. Two ESXi nodes are embedded inside the appliance whichhas direct access to the same storage resources. This close integration allows applications such as Sparkand Hadoop to take full advantage of server and storage virtualization with simplified deployment andmanagement. AppsON is available on the PowerStore X model exclusively.1.4.2Agile infrastructure, flexible scaling on a high-performing storage and computeplatformPowerStore provides flexible scaling with ease of management that compliments the Spark and Hadoopscale-up and scale-out distribution model. The integrated hypervisor dynamically scales up the cluster nodeswhen the workload requires it, while you can rapidly provision new nodes on the same or on other appliancesin a different location.Big-data applications require large amount of data and computational power for analytics, machining learning,model training, and other workloads. With a PowerStore appliance, administrators can scale up the storagecapacity by adding disks and disk expansion enclosures without service interruption at any time. You can alsoconfigure multiple PowerStore appliances into a cluster to increase CPUs, memory, storage capacity, andfront-end connectivity. Clustering simplifies and centralizes the management of multiple appliances fromPowerStore Manager, a single HTML5-based management interface. A cluster can consist of up to fourPowerStore T appliances or four PowerStore X appliances. Each appliance within the cluster can havedifferent configurations of CPUs, memory, NVMe drives, and expansion enclosures.The NVMe architecture is designed for the next-generation NVMe-based storage and takes advantage of lowoverhead NVRAM cache. PowerStore is engineered to handle the most demanding workloads.1.4.3Mission-critical high availability and fault-tolerant platformPowerStore provides a high level of stability and reliability for Spark and HDFS. At the hardware level,PowerStore is highly available and fault tolerant. It monitors the storage devices continuously, and itautomatically relocates data from failing devices to avoid data loss. The PowerStore X model applianceincludes two ESXi nodes and redundant hardware components. The nondisruptive upgrade (NDU) featurefurther increases overall PowerStore availability. The updates take place on the nodes in a rolling fashion.NDU supports PowerStore software releases, hotfixes, and hardware and disk firmware.The dynamic resiliency engine (DRE) feature automatically protects and repairs the underlying storage fromdrive failures. Administrators are not required to manually configure or manage the protection settings for thedrives.9Dell EMC PowerStore: Apache Spark Solution Guide H18663

IntroductionTo support high-value business workloads and service requirements on the application level, it is essential toprotect and ensure the availability of the Spark and HDFS nodes. The Spark master node and the HDFSNameNode are central to all operations of the application in the cluster. When they become inaccessible, allapplications and storage operations are affected. Also, if a DataNode is not reachable for an extended period,the Namenode determines the blocks on the failed node and starts making copies of the blocks from otherreplicas until the replication factor is met.With standard VMware vSphere High Availability (HA) integrated into PowerStore, the embedded VMwareESXi hypervisor automatically restarts or migrates failed Spark and HDFS servers to a different ESXi node.This process resumes operations by helping restore Spark and HDFS to its full operation capacity, and itminimizes the chance of the DataNodes being marked dead.To achieve an even higher level of redundancy and application availability, you can deploy the Spark clusterand HDFS cluster across multiple PowerStore appliances in different racks, floors, or locations. PowerStoreimproves application availability and provides unparalleled flexibility and mobility to relocate and move acrossdata centers and appliances.1.4.4PowerStore inline data reduction reduces storage consumption and costData science and big-data applications continuously pull in a tremendous amount of data from varioussources. To help reduce storage consumption and cost, the PowerStore inline data-reduction featuremaximizes space savings by combining both software data deduplication and hardware compression. Datareduction works seamlessly in the background, is always enabled, and cannot be disabled. Since datareduction is always active in PowerStore, enabling application or operating system compression may notprovide additional savings.1.4.5Efficient and convenient snapshot data backupPowerStore provides Spark and HDFS with extra data protection through array-based snapshots. APowerStore snapshot is a point-in-time copy of the data. The snapshots are space efficient and requireseconds to create. Snapshot data are exact copies of the source data and can be used for application testing,backup, or DevOps. Because of the tight integration with VMware vSphere, PowerStore can take vVol VMsnapshots directly from PowerStore Manager using a protection policy schedule or on demand. You can viewthe VM snapshot information in PowerStore and vCenter.1.4.6Secure data protection with ease of mindWith high-value data driving business applications, data security is a top concern for all organizations. Lost orstolen data can seriously damage the reputation of an organization and result in huge financial costs and lossof customer trust. Dell Technologies engineered PowerStore with Data at Rest Encryption (D@RE) whichuses self-encrypting drives and supports array-based, self-managed keys. When D@RE is activated, data isencrypted as it is written to disk using the 256-bit Advanced Encryption Standard (AES). PowerStore D@REprovides this data security benefit to Spark applications while eliminating application overhead, performancepenalties, and administrative overhead that is typically associated with software-based solutions.1.4.7Unified infrastructure and services managementPowerStore provides deep integration with VMware management tools and services with Dell EMC VirtualStorage Integrator (VSI), VMware vRealize Operations Manager (vROps), VMware vRealize Orchestrator(vRO), and VMware Storage Replication Adapter (SRA). You can easily incorporate ESXi on PowerStore Xmodels into your existing vCenter and manage all VMware infrastructure and services from a unifiedmanagement platform.10Dell EMC PowerStore: Apache Spark Solution Guide H18663

Introduction1.4.8Spark value and future expansionBig-data platforms such as Spark and Hadoop create enormous value for organizations. As the value andscale of this data grows, it is critical to have a future-proof platform that is easy to manage. The platform mustalso provide technical innovation for future growth, and support the application architecture. Spark andHadoop on PowerStore bring IT organizations the ability to be agile, efficient, and responsive to businessdemands.1.5TerminologyThe following terms are used with PowerStore.Appliance: Solution containing a base enclosure and attached expansion enclosures. The size of anappliance could be only the base enclosure or the base enclosure plus expansion enclosures.PowerStore node: Storage controller that provides the processing resources for performing storageoperations and servicing I/O between storage and hosts. Each PowerStore appliance contains two nodes.Base enclosure: Enclosure containing both nodes (node A and node B) and 25 NVMe drive slotsExpansion enclosure: Enclosures that can be attached to a base enclosure to provide additional storage.Fibre Channel (FC) protocol: Protocol used to perform SCSI commands over a Fibre Channel network.iSCSI: Provides a mechanism for accessing block-level data storage over network connections.NDU: A nondisruptive upgrade (NDU) updates PowerStore and maximizes its availability by performing rollingupdates. This includes updates for PowerStore software releases, hotfixes, and hardware and disk firmware.NVMe: Non-Volatile Memory Express is a communication interface and driver for accessing nonvolatilestorage media such as solid-state drives (SSD) and SCM drives through the PCIe bus.NVMe over Fibre Channel (NVMe-FC): Allows hosts to access storage systems across a networkfabric with the NVMe protocol using Fibre Channel as the underlying transport.NVRAM: Nonvolatile random-access memory is persistent random-access memory that retains data withoutan electrical charge. NVRAM drives are used in PowerStore appliance as additional system write caching.Volume: A block-level storage device that can be shared out using a protocol such as iSCSI or FibreChannel.Snapshot: A point-in-time view of data that is stored on a storage resource. You can recover files from asnapshot, restore a storage resource from a snapshot, or provide access to a host.Storage container: A VMware term for a logical entity that consists of one or more capability profiles andtheir storage limits. This entity is known as a vVol datastore when it is mounted in vSphere.PCIe: Peripheral Component Interconnect Express is a high-speed serial computer expansion bus standard.PowerStore Manager: An HTML5 management interface for creating storage resources and configuring andscheduling protection of stored data on PowerStore. PowerStore Manager can be used for all management ofPowerStore native replication.11Dell EMC PowerStore: Apache Spark Solution Guide H18663

IntroductionPowerStore T model: Container-based storage system that is running on purpose-built hardware. Thisstorage system supports unified (block and file) workloads, or block-optimized workloads.PowerStore X model: Container-based storage system that runs inside a virtual machine that is deployed ona VMware hypervisor. Besides offering block-optimized workloads, PowerStore also allows you to deployapplications directly on the array.RecoverPoint for Virtual Machines: Protects VMs in a VMware environment with VM-level granularity

1.3 Apache Hadoop Distributed File System overview Apache Hadoop is an open-source software suite and framework for big-data processing. Hadoop Distributed File System (HDFS) is one of the core components of Hadoop. It is a distributed file system designed to be massively scalable, fault tolerant, and have high throughput.