Apache Hadoop YARN - Pearsoncmg

Transcription

Apache HadoopYARN

The Addison-Wesley Data and Analytics SeriesVisit informit.com/awdataseries for a complete list of available publications.T he Addison-Wesley Data and Analytics Series provides readers with practicalknowledge for solving problems and answering questions with data. Titles in this seriesprimarily focus on three areas:1. Infrastructure: how to store, move, and manage data2. Algorithms: how to mine intelligence or make predictions based on data3. Visualizations: how to represent data and insights in a meaningful and compelling wayThe series aims to tie all three of these areas together to help the reader build end-to-endsystems for fighting spam; making recommendations; building personalization;detecting trends, patterns, or problems; and gaining insight from the data exhaust ofsystems and user interactions.Make sure to connect with us!informit.com/socialconnect

Apache HadoopYARN Moving beyond MapReduce andBatch Processing withApache Hadoop 2 Arun C. MurthyVinod Kumar VavilapalliDoug EadlineJoseph NiemiecJeff MarkhamUpper Saddle River, NJ Boston Indianapolis San FranciscoNew York Toronto Montreal London Munich Paris MadridCapetown Sydney Tokyo Singapore Mexico City

Many of the designations used by manufacturers and sellers to distinguish their products areclaimed as trademarks. Where those designations appear in this book, and the publisher wasaware of a trademark claim, the designations have been printed with initial capital letters or in allcapitals.The authors and publisher have taken care in the preparation of this book, but make no expressedor implied warranty of any kind and assume no responsibility for errors or omissions. No liability isassumed for incidental or consequential damages in connection with or arising out of the use ofthe information or programs contained herein.For information about buying this title in bulk quantities, or for special sales opportunities (whichmay include electronic versions; custom cover designs; and content particular to your business,training goals, marketing focus, or branding interests), please contact our corporate sales department at corpsales@pearsoned.com or (800) 382-3419.For government sales inquiries, please contact governmentsales@pearsoned.com.For questions about sales outside the United States, please contact international@pearsoned.com.Visit us on the Web: informit.com/awLibrary of Congress Cataloging-in-Publication DataMurthy, Arun C.Apache Hadoop YARN : moving beyond MapReduce and batch processing with Apache Hadoop 2/ Arun C. Murthy, Vinod Kumar Vavilapalli, Doug Eadline, Joseph Niemiec, Jeff Markham.pages cmIncludes index.ISBN 978-0-321-93450-5 (pbk. : alk. paper)1. Apache Hadoop. 2. Electronic data processing—Distributed processing. I. Title.QA76.9.D5M97 2014004'.36—dc232014003391Copyright 2014 Hortonworks Inc.Apache, Apache Hadoop, Hadoop, and the Hadoop elephant logo are trademarks of The ApacheSoftware Foundation. Used with permission. No endorsement by The Apache Software Foundationis implied by the use of these marks.Hortonworks is a trademark of Hortonworks, Inc., registered in the U.S. and other countries.All rights reserved. Printed in the United States of America. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction,storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical,photocopying, recording, or likewise. To obtain permission to use material from this work, pleasesubmit a written request to Pearson Education, Inc., Permissions Department, One Lake Street,Upper Saddle River, New Jersey 07458, or you may fax your request to (201) 236-3290.ISBN-13: 978-0-321-93450-5ISBN-10: 0-321-93450-4Text printed in the United States on recycled paper at RR Donnelley in Crawfordsville, Indiana.First printing, March 2014

ContentsForeword by Raymie Stata   xiiiForeword by Paul Dix   xvPreface  xviiAcknowledgments  xxiAbout the Authors   xxv1 Apache Hadoop YARN:A Brief History and Rationale   1Introduction  1Apache Hadoop   2Phase 0: The Era of Ad Hoc Clusters   3Phase 1: Hadoop on Demand   3HDFS in the HOD World   5Features and Advantages of HOD   6Shortcomings of Hadoop on Demand   7Phase 2: Dawn of the Shared Compute Clusters   9Evolution of Shared Clusters   9Issues with Shared MapReduce Clusters   15Phase 3: Emergence of YARN   18Conclusion  202 Apache Hadoop YARN Install Quick Start   21Getting Started   22Steps to Configure a Single-Node YARN Cluster   22Step 1: Download Apache Hadoop    22Step 2: Set JAVA HOME   23Step 3: Create Users and Groups   23Step 4: Make Data and Log Directories   23Step 5: Configure core-site.xml    24Step 6: Configure hdfs-site.xml   24Step 7: Configure mapred-site.xml   25Step 8: Configure yarn-site.xml   25Step 9: Modify Java Heap Sizes   26Step 10: Format HDFS   26Step 11: Start the HDFS Services   27

viContentsStep 12: Start YARN Services   28Step 13: Verify the Running Services Using theWeb Interface   28Run Sample MapReduce Examples   30Wrap-up  313 Apache Hadoop YARN Core Concepts   33Beyond MapReduce   33The MapReduce Paradigm   35Apache Hadoop MapReduce   35The Need for Non-MapReduce Workloads   37Addressing Scalability   37Improved Utilization   38User Agility   38Apache Hadoop YARN   38YARN Components   Resource Model   41ResourceRequests and Containers   41Container Specification   42Wrap-up  424 Functional Overview of YARN Components    43Architecture Overview   43ResourceManager  45YARN Scheduling Components   46FIFO Scheduler   46Capacity Scheduler   47Fair Scheduler   Master  50YARN Resource Model   50Client Resource Request   51ApplicationMaster Container Allocation   51ApplicationMaster–ContainerManager Communication   52

ContentsManaging Application Dependencies   53LocalResources Definitions   54LocalResource Timestamps   55LocalResource Types   55LocalResource Visibilities   56Lifetime of LocalResources   57Wrap-up  575 Installing Apache Hadoop YARN   59The Basics   59System Preparation   60Step 1: Install EPEL and pdsh   60Step 2: Generate and Distribute ssh Keys   61Script-based Installation of Hadoop 2   62JDK Options   62Step 1: Download and Extract the Scripts   63Step 2: Set the Script Variables   63Step 3: Provide Node Names   64Step 4: Run the Script   64Step 5: Verify the Installation   65Script-based Uninstall   68Configuration File Processing   68Configuration File Settings   -site.xml  69yarn-site.xml  70Start-up Scripts   71Installing Hadoop with Apache Ambari   71Performing an Ambari-basedHadoop Installation   72Step 1: Check Requirements   73Step 2: Install the Ambari Server    73Step 3: Install and Start Ambari Agents   73Step 4: Start the Ambari Server   74Step 5: Install an HDP2.X Cluster   75Wrap-up  84vii

viiiContents6 Apache Hadoop YARN Administration   85Script-based Configuration   85Monitoring Cluster Health: Nagios   90Monitoring Basic Hadoop Services   92Monitoring the JVM   95Real-time Monitoring: Ganglia   97Administration with Ambari   99JVM Analysis   103Basic YARN Administration    106YARN Administrative Tools   106Adding and Decommissioning YARN Nodes   107Capacity Scheduler Configuration   108YARN WebProxy   108Using the JobHistoryServer   108Refreshing User-to-Groups Mappings   108Refreshing Superuser Proxy GroupsMappings  109Refreshing ACLs for Administration ofResourceManager    109Reloading the Service-level AuthorizationPolicy File   109Managing YARN Jobs   109Setting Container Memory   110Setting Container Cores   110Setting MapReduce Properties   110User Log Management    111Wrap-up  1147 Apache Hadoop YARN Architecture Guide   ew of the ResourceManagerComponents  118Client Interaction with theResourceManager    118Application Interaction with theResourceManager  120

ContentsInteraction of Nodes with theResourceManager  121Core ResourceManager Components   122Security-related Components in rview of the NodeManager Components   128NodeManager Components   129NodeManager Security Components   136Important NodeManager Functions   liness  139Resource Requirements   140Scheduling  140Scheduling Protocol and Locality   142Launching Containers   145Completed Containers   146ApplicationMaster Failures and Recovery   146Coordination and Output Commit   146Information for Clients   147Security  147Cleanup on ApplicationMaster Exit   147YARN Containers   148Container Environment   148Communication with the ApplicationMaster   149Summary for Application-writers   150Wrap-up  1518 Capacity Scheduler in YARN   153Introduction to the Capacity Scheduler   153Elasticity with Multitenancy   154Security  154Resource Awareness   154Granular Scheduling   154Locality  155Scheduling Policies   155Capacity Scheduler Configuration   155ix

xContentsQueues  156Hierarchical Queues   156Key Characteristics   157Scheduling Among Queues   157Defining Hierarchical Queues   158Queue Access Control   159Capacity Management with Queues   160User Limits   163Reservations  166State of the Queues   167Limits on Applications   168User Interface   169Wrap-up  1699 MapReduce with Apache Hadoop YARN    171Running Hadoop YARN MapReduce Examples   171Listing Available Examples   171Running the Pi Example    172Using the Web GUI to Monitor Examples   174Running the Terasort Test   180Run the TestDFSIO Benchmark   180MapReduce Compatibility   181The MapReduce ApplicationMaster   181Enabling Application Master Restarts    182Enabling Recovery of Completed Tasks   182The JobHistory Server   182Calculating the Capacity of a Node   182Changes to the Shuffle Service   184Running Existing Hadoop Version 1Applications    184Binary Compatibility of org.apache.hadoop.mapredAPIs  184Source Compatibility of org.apache.hadoop.mapreduce APIs   185Compatibility of Command-line Scripts   185Compatibility Tradeoff Between MRv1 and EarlyMRv2 (0.23.x) Applications   185

ContentsRunning MapReduce Version 1 Existing Code   187Running Apache Pig Scripts on YARN   187Running Apache Hive Queries on YARN   187Running Apache Oozie Workflows on YARN   188Advanced Features   188Uber Jobs   188Pluggable Shuffle and Sort   188Wrap-up  19010 Apache Hadoop YARN Application Example   191The YARN Client   191The ApplicationMaster   208Wrap-up  22611 Using Apache Hadoop YARNDistributed-Shell  227Using the YARN Distributed-Shell   227A Simple Example   228Using More Containers   229Distributed-Shell Examples with ShellArguments  230Internals of the Distributed-Shell   232Application Constants    232Client  233ApplicationMaster  236Final Containers   240Wrap-up  24012 Apache Hadoop YARN Frameworks    241Distributed-Shell  241Hadoop MapReduce   241Apache Tez   242Apache Giraph   242Hoya: HBase on YARN  243Dryad on YARN   243Apache Spark   244Apache Storm   244xi

xiiContentsREEF: Retainable Evaluator ExecutionFramework  245Hamster: Hadoop and MPI on theSame Cluster  245Wrap-up  245A Supplemental Content and CodeDownloads  247Available Downloads   247B YARN Installation Scripts     256hadoop-xml-conf.sh  258C YARN Administration Scripts   263configure-hadoop2.sh263D Nagios Modules   269check resource manager.sh  269check data node.sh  271check resource manager old space pct.sh  272E Resources and Additional Information   277F HDFS Quick Reference   279Quick Command Reference   279Starting HDFS and the HDFS Web GUI   280Get an HDFS Status Report   280Perform an FSCK on HDFS   281General HDFS Commands   281List Files in HDFS   282Make a Directory in HDFS   283Copy Files to HDFS   283Copy Files from HDFS   284Copy Files within HDFS   284Delete a File within HDFS   284Delete a Directory in HDFS   284Decommissioning HDFS Nodes   284Index  287

Foreword by Raymie StataWilliam Gibson was fond of saying: “The future is already here—it’s just not veryevenly distributed.” Those of us who have been in the web search industry have hadthe privilege—and the curse—of living in the future of Big Data when it wasn’t distributed at all. What did we learn? We learned to measure everything. We learnedto experiment. We learned to mine signals out of unstructured data. We learned todrive business value through data science. And we learned that, to do these things,we needed a new data-processing platform fundamentally different from the businessintelligence systems being developed at the time.The future of Big Data is rapidly arriving for almost all industries. This is drivenin part by widespread instrumentation of the physical world—vehicles, buildings, andeven people are spitting out log streams not unlike the weblogs we know and lovein cyberspace. Less obviously, digital records—such as digitized government records,digitized insurance policies, and digital medical records—are creating a trove of information not unlike the webpages crawled and parsed by search engines. It’s no surprise,then, that the tools and techniques pioneered first in the world of web search are finding currency in more and more industries. And the leading such tool, of course, isApache Hadoop.But Hadoop is close to ten years old. Computing infrastructure has advancedsignificantly in this decade. If Hadoop was to maintain its relevance in the modernBig Data world, it needed to advance as well. YARN represents just the advancementneeded to keep Hadoop relevant.As described in the historical overview provided in this book, for the majority ofHadoop’s existence, it supported a single computing paradigm: MapReduce. On thecompute servers we had at the time, horizontal scaling—throwing more server nodesat a problem—was the only way the web search industry could hope to keep pace withthe growth of the web.

Apache Hadoop YARN Moving beyond MapReduce and . What did we learn? We learned to measure everything. We learned to experiment. We learned to mine signals out of unstructured data. We learned to drive business value through data science. And we learned that, to do these things, we needed a new data-processing platform fundamentally different from the business intelligence systems being .