Transcription
Apache HadoopYARN
The Addison-Wesley Data and Analytics SeriesVisit informit.com/awdataseries for a complete list of available publications.T he Addison-Wesley Data and Analytics Series provides readers with practicalknowledge for solving problems and answering questions with data. Titles in this seriesprimarily focus on three areas:1. Infrastructure: how to store, move, and manage data2. Algorithms: how to mine intelligence or make predictions based on data3. Visualizations: how to represent data and insights in a meaningful and compelling wayThe series aims to tie all three of these areas together to help the reader build end-to-endsystems for fighting spam; making recommendations; building personalization;detecting trends, patterns, or problems; and gaining insight from the data exhaust ofsystems and user interactions.Make sure to connect with us!informit.com/socialconnect
Apache HadoopYARN Moving beyond MapReduce andBatch Processing withApache Hadoop 2 Arun C. MurthyVinod Kumar VavilapalliDoug EadlineJoseph NiemiecJeff MarkhamUpper Saddle River, NJ Boston Indianapolis San FranciscoNew York Toronto Montreal London Munich Paris MadridCapetown Sydney Tokyo Singapore Mexico City
Many of the designations used by manufacturers and sellers to distinguish their products areclaimed as trademarks. Where those designations appear in this book, and the publisher wasaware of a trademark claim, the designations have been printed with initial capital letters or in allcapitals.The authors and publisher have taken care in the preparation of this book, but make no expressedor implied warranty of any kind and assume no responsibility for errors or omissions. No liability isassumed for incidental or consequential damages in connection with or arising out of the use ofthe information or programs contained herein.For information about buying this title in bulk quantities, or for special sales opportunities (whichmay include electronic versions; custom cover designs; and content particular to your business,training goals, marketing focus, or branding interests), please contact our corporate sales department at corpsales@pearsoned.com or (800) 382-3419.For government sales inquiries, please contact governmentsales@pearsoned.com.For questions about sales outside the United States, please contact international@pearsoned.com.Visit us on the Web: informit.com/awLibrary of Congress Cataloging-in-Publication DataMurthy, Arun C.Apache Hadoop YARN : moving beyond MapReduce and batch processing with Apache Hadoop 2/ Arun C. Murthy, Vinod Kumar Vavilapalli, Doug Eadline, Joseph Niemiec, Jeff Markham.pages cmIncludes index.ISBN 978-0-321-93450-5 (pbk. : alk. paper)1. Apache Hadoop. 2. Electronic data processing—Distributed processing. I. Title.QA76.9.D5M97 2014004'.36—dc232014003391Copyright 2014 Hortonworks Inc.Apache, Apache Hadoop, Hadoop, and the Hadoop elephant logo are trademarks of The ApacheSoftware Foundation. Used with permission. No endorsement by The Apache Software Foundationis implied by the use of these marks.Hortonworks is a trademark of Hortonworks, Inc., registered in the U.S. and other countries.All rights reserved. Printed in the United States of America. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction,storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical,photocopying, recording, or likewise. To obtain permission to use material from this work, pleasesubmit a written request to Pearson Education, Inc., Permissions Department, One Lake Street,Upper Saddle River, New Jersey 07458, or you may fax your request to (201) 236-3290.ISBN-13: 978-0-321-93450-5ISBN-10: 0-321-93450-4Text printed in the United States on recycled paper at RR Donnelley in Crawfordsville, Indiana.First printing, March 2014
ContentsForeword by Raymie Stata xiiiForeword by Paul Dix xvPreface xviiAcknowledgments xxiAbout the Authors xxv1 Apache Hadoop YARN:A Brief History and Rationale 1Introduction 1Apache Hadoop 2Phase 0: The Era of Ad Hoc Clusters 3Phase 1: Hadoop on Demand 3HDFS in the HOD World 5Features and Advantages of HOD 6Shortcomings of Hadoop on Demand 7Phase 2: Dawn of the Shared Compute Clusters 9Evolution of Shared Clusters 9Issues with Shared MapReduce Clusters 15Phase 3: Emergence of YARN 18Conclusion 202 Apache Hadoop YARN Install Quick Start 21Getting Started 22Steps to Configure a Single-Node YARN Cluster 22Step 1: Download Apache Hadoop 22Step 2: Set JAVA HOME 23Step 3: Create Users and Groups 23Step 4: Make Data and Log Directories 23Step 5: Configure core-site.xml 24Step 6: Configure hdfs-site.xml 24Step 7: Configure mapred-site.xml 25Step 8: Configure yarn-site.xml 25Step 9: Modify Java Heap Sizes 26Step 10: Format HDFS 26Step 11: Start the HDFS Services 27
viContentsStep 12: Start YARN Services 28Step 13: Verify the Running Services Using theWeb Interface 28Run Sample MapReduce Examples 30Wrap-up 313 Apache Hadoop YARN Core Concepts 33Beyond MapReduce 33The MapReduce Paradigm 35Apache Hadoop MapReduce 35The Need for Non-MapReduce Workloads 37Addressing Scalability 37Improved Utilization 38User Agility 38Apache Hadoop YARN 38YARN Components Resource Model 41ResourceRequests and Containers 41Container Specification 42Wrap-up 424 Functional Overview of YARN Components 43Architecture Overview 43ResourceManager 45YARN Scheduling Components 46FIFO Scheduler 46Capacity Scheduler 47Fair Scheduler Master 50YARN Resource Model 50Client Resource Request 51ApplicationMaster Container Allocation 51ApplicationMaster–ContainerManager Communication 52
ContentsManaging Application Dependencies 53LocalResources Definitions 54LocalResource Timestamps 55LocalResource Types 55LocalResource Visibilities 56Lifetime of LocalResources 57Wrap-up 575 Installing Apache Hadoop YARN 59The Basics 59System Preparation 60Step 1: Install EPEL and pdsh 60Step 2: Generate and Distribute ssh Keys 61Script-based Installation of Hadoop 2 62JDK Options 62Step 1: Download and Extract the Scripts 63Step 2: Set the Script Variables 63Step 3: Provide Node Names 64Step 4: Run the Script 64Step 5: Verify the Installation 65Script-based Uninstall 68Configuration File Processing 68Configuration File Settings -site.xml 69yarn-site.xml 70Start-up Scripts 71Installing Hadoop with Apache Ambari 71Performing an Ambari-basedHadoop Installation 72Step 1: Check Requirements 73Step 2: Install the Ambari Server 73Step 3: Install and Start Ambari Agents 73Step 4: Start the Ambari Server 74Step 5: Install an HDP2.X Cluster 75Wrap-up 84vii
viiiContents6 Apache Hadoop YARN Administration 85Script-based Configuration 85Monitoring Cluster Health: Nagios 90Monitoring Basic Hadoop Services 92Monitoring the JVM 95Real-time Monitoring: Ganglia 97Administration with Ambari 99JVM Analysis 103Basic YARN Administration 106YARN Administrative Tools 106Adding and Decommissioning YARN Nodes 107Capacity Scheduler Configuration 108YARN WebProxy 108Using the JobHistoryServer 108Refreshing User-to-Groups Mappings 108Refreshing Superuser Proxy GroupsMappings 109Refreshing ACLs for Administration ofResourceManager 109Reloading the Service-level AuthorizationPolicy File 109Managing YARN Jobs 109Setting Container Memory 110Setting Container Cores 110Setting MapReduce Properties 110User Log Management 111Wrap-up 1147 Apache Hadoop YARN Architecture Guide ew of the ResourceManagerComponents 118Client Interaction with theResourceManager 118Application Interaction with theResourceManager 120
ContentsInteraction of Nodes with theResourceManager 121Core ResourceManager Components 122Security-related Components in rview of the NodeManager Components 128NodeManager Components 129NodeManager Security Components 136Important NodeManager Functions liness 139Resource Requirements 140Scheduling 140Scheduling Protocol and Locality 142Launching Containers 145Completed Containers 146ApplicationMaster Failures and Recovery 146Coordination and Output Commit 146Information for Clients 147Security 147Cleanup on ApplicationMaster Exit 147YARN Containers 148Container Environment 148Communication with the ApplicationMaster 149Summary for Application-writers 150Wrap-up 1518 Capacity Scheduler in YARN 153Introduction to the Capacity Scheduler 153Elasticity with Multitenancy 154Security 154Resource Awareness 154Granular Scheduling 154Locality 155Scheduling Policies 155Capacity Scheduler Configuration 155ix
xContentsQueues 156Hierarchical Queues 156Key Characteristics 157Scheduling Among Queues 157Defining Hierarchical Queues 158Queue Access Control 159Capacity Management with Queues 160User Limits 163Reservations 166State of the Queues 167Limits on Applications 168User Interface 169Wrap-up 1699 MapReduce with Apache Hadoop YARN 171Running Hadoop YARN MapReduce Examples 171Listing Available Examples 171Running the Pi Example 172Using the Web GUI to Monitor Examples 174Running the Terasort Test 180Run the TestDFSIO Benchmark 180MapReduce Compatibility 181The MapReduce ApplicationMaster 181Enabling Application Master Restarts 182Enabling Recovery of Completed Tasks 182The JobHistory Server 182Calculating the Capacity of a Node 182Changes to the Shuffle Service 184Running Existing Hadoop Version 1Applications 184Binary Compatibility of org.apache.hadoop.mapredAPIs 184Source Compatibility of org.apache.hadoop.mapreduce APIs 185Compatibility of Command-line Scripts 185Compatibility Tradeoff Between MRv1 and EarlyMRv2 (0.23.x) Applications 185
ContentsRunning MapReduce Version 1 Existing Code 187Running Apache Pig Scripts on YARN 187Running Apache Hive Queries on YARN 187Running Apache Oozie Workflows on YARN 188Advanced Features 188Uber Jobs 188Pluggable Shuffle and Sort 188Wrap-up 19010 Apache Hadoop YARN Application Example 191The YARN Client 191The ApplicationMaster 208Wrap-up 22611 Using Apache Hadoop YARNDistributed-Shell 227Using the YARN Distributed-Shell 227A Simple Example 228Using More Containers 229Distributed-Shell Examples with ShellArguments 230Internals of the Distributed-Shell 232Application Constants 232Client 233ApplicationMaster 236Final Containers 240Wrap-up 24012 Apache Hadoop YARN Frameworks 241Distributed-Shell 241Hadoop MapReduce 241Apache Tez 242Apache Giraph 242Hoya: HBase on YARN 243Dryad on YARN 243Apache Spark 244Apache Storm 244xi
xiiContentsREEF: Retainable Evaluator ExecutionFramework 245Hamster: Hadoop and MPI on theSame Cluster 245Wrap-up 245A Supplemental Content and CodeDownloads 247Available Downloads 247B YARN Installation Scripts 256hadoop-xml-conf.sh 258C YARN Administration Scripts 263configure-hadoop2.sh263D Nagios Modules 269check resource manager.sh 269check data node.sh 271check resource manager old space pct.sh 272E Resources and Additional Information 277F HDFS Quick Reference 279Quick Command Reference 279Starting HDFS and the HDFS Web GUI 280Get an HDFS Status Report 280Perform an FSCK on HDFS 281General HDFS Commands 281List Files in HDFS 282Make a Directory in HDFS 283Copy Files to HDFS 283Copy Files from HDFS 284Copy Files within HDFS 284Delete a File within HDFS 284Delete a Directory in HDFS 284Decommissioning HDFS Nodes 284Index 287
Foreword by Raymie StataWilliam Gibson was fond of saying: “The future is already here—it’s just not veryevenly distributed.” Those of us who have been in the web search industry have hadthe privilege—and the curse—of living in the future of Big Data when it wasn’t distributed at all. What did we learn? We learned to measure everything. We learnedto experiment. We learned to mine signals out of unstructured data. We learned todrive business value through data science. And we learned that, to do these things,we needed a new data-processing platform fundamentally different from the businessintelligence systems being developed at the time.The future of Big Data is rapidly arriving for almost all industries. This is drivenin part by widespread instrumentation of the physical world—vehicles, buildings, andeven people are spitting out log streams not unlike the weblogs we know and lovein cyberspace. Less obviously, digital records—such as digitized government records,digitized insurance policies, and digital medical records—are creating a trove of information not unlike the webpages crawled and parsed by search engines. It’s no surprise,then, that the tools and techniques pioneered first in the world of web search are finding currency in more and more industries. And the leading such tool, of course, isApache Hadoop.But Hadoop is close to ten years old. Computing infrastructure has advancedsignificantly in this decade. If Hadoop was to maintain its relevance in the modernBig Data world, it needed to advance as well. YARN represents just the advancementneeded to keep Hadoop relevant.As described in the historical overview provided in this book, for the majority ofHadoop’s existence, it supported a single computing paradigm: MapReduce. On thecompute servers we had at the time, horizontal scaling—throwing more server nodesat a problem—was the only way the web search industry could hope to keep pace withthe growth of the web.
Apache Hadoop YARN Moving beyond MapReduce and . What did we learn? We learned to measure everything. We learned to experiment. We learned to mine signals out of unstructured data. We learned to drive business value through data science. And we learned that, to do these things, we needed a new data-processing platform fundamentally different from the business intelligence systems being .