Transcription
Hadoop 2Quick-Start Guide Eadline FM.indd 110/7/15 4:24 AM
Eadline FM.indd 210/7/15 4:24 AM
Hadoop 2Quick-Start Guide Learn the Essentials of BigData Computing in the ApacheHadoop 2 Ecosystem Douglas EadlineNew York Boston Indianapolis San FranciscoToronto Montreal London Munich Paris MadridCapetown Sydney Tokyo Singapore Mexico CityEadline FM.indd 310/7/15 4:24 AM
Many of the designations used by manufacturers and sellers to distinguish their products areclaimed as trademarks. Where those designations appear in this book, and the publisher wasaware of a trademark claim, the designations have been printed with initial capital letters or in allcapitals.The author and publisher have taken care in the preparation of this book, but make no expressedor implied warranty of any kind and assume no responsibility for errors or omissions. No liability isassumed for incidental or consequential damages in connection with or arising out of the use ofthe information or programs contained herein.For information about buying this title in bulk quantities, or for special sales opportunities (whichmay include electronic versions; custom cover designs; and content particular to your business,training goals, marketing focus, or branding interests), please contact our corporate sales department at corpsales@pearsoned.com or (800) 382-3419.For government sales inquiries, please contact governmentsales@pearsoned.com.For questions about sales outside the United States, please contact international@pearsoned.com.Visit us on the Web: informit.com/awLibrary of Congress Cataloging-in-Publication DataEadline, Doug, 1956-author.Learn the essential aspects of big data computing in the Apache Hadoop 2 ecosystem /Doug Eadline.pages cmIncludes bibliographical references and index.ISBN 978-0-13-404994-6 (pbk. : alk. paper) —ISBN 0-13-404994-2 (pbk. : alk. paper)1. Big data. 2. Data mining. 3. Apache Hadoop. I. Title.QA76.9.B45E24 2016006.3'12—dc232015030746Copyright 2016 Pearson Education, Inc.Apache , Apache Hadoop , and Hadoop are trademarks of The Apache Software Foundation.Used with permission. No endorsement by The Apache Software Foundation is implied by the useof these marks.All rights reserved. Printed in the United States of America. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction,storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical,photocopying, recording, or likewise. To obtain permission to use material from this work, pleasesubmit a written request to Pearson Education, Inc., Permissions Department, 200 Old TappanRoad, Old Tappan, New Jersey 07675, or you may fax your request to (201) 236-3290.ISBN-13: 978-0-13-404994-6ISBN-10: 0-13-404994-2Text printed in the United States on recycled paper at RR Donnelley in Crawfordsville, Indiana.First printing, November 2015Eadline FM.indd 410/7/15 4:24 AM
gments xixAbout the Author xxi1 Background and Concepts 1Defining Apache Hadoop 1A Brief History of Apache Hadoop 3Defining Big Data 4Hadoop as a Data Lake 5Using Hadoop: Administrator, User, or Both 6First There Was MapReduce 7Apache Hadoop Design Principles 7Apache Hadoop MapReduce Example 8MapReduce Advantages 10Apache Hadoop V1 MapReduce Operation 11Moving Beyond MapReduce with Hadoop V2 13Hadoop V2 YARN Operation Design 13The Apache Hadoop Project Ecosystem 15Summary and Additional Resources 182 Installation Recipes 19Core Hadoop Services 19Hadoop Configuration Files 20Planning Your Resources 21Hardware Choices 21Software Choices 22Installing on a Desktop or Laptop 23Installing Hortonworks HDP 2.2 Sandbox 23Installing Hadoop from Apache Sources 29Installing Hadoop with Ambari 40Performing an Ambari Installation 42Undoing the Ambari Install 55Installing Hadoop in the Cloud UsingApache Whirr 56Step 1: Install Whirr 57Step 2: Configure Whirr 57Eadline FM.indd 510/7/15 4:24 AM
viContentsStep 3: Launch the Cluster 59Step 4: Take Down Your Cluster 61Summary and Additional Resources 623 H adoop Distributed File System Basics 63Hadoop Distributed File System DesignFeatures 63HDFS Components 64HDFS Block Replication 67HDFS Safe Mode 68Rack Awareness 68NameNode High Availability 69HDFS Namespace Federation 70HDFS Checkpoints and Backups 71HDFS Snapshots 71HDFS NFS Gateway 72HDFS User Commands 72Brief HDFS Command Reference 72General HDFS Commands 73List Files in HDFS 75Make a Directory in HDFS 76Copy Files to HDFS 76Copy Files from HDFS 76Copy Files within HDFS 76Delete a File within HDFS 76Delete a Directory in HDFS 77Get an HDFS Status Report 77HDFS Web GUI 77Using HDFS in Programs 77HDFS Java Application Example 78HDFS C Application Example 82Summary and Additional Resources 834 R unning Example Programs andBenchmarks 85Running MapReduce Examples 85Listing Available Examples 86Eadline FM.indd 610/7/15 4:24 AM
ContentsviiRunning the Pi Example 87Using the Web GUI to Monitor Examples 89Running Basic Hadoop Benchmarks 95Running the Terasort Test 95Running the TestDFSIO Benchmark 96Managing Hadoop MapReduce Jobs 97Summary and Additional Resources 985 Hadoop MapReduce Framework 101The MapReduce Model 101MapReduce Parallel Data Flow 104Fault Tolerance and Speculative Execution 107Speculative Execution 108Hadoop MapReduce Hardware 108Summary and Additional Resources 1096 MapReduce Programming 111Compiling and Running the HadoopWordCount Example 111Using the Streaming Interface 116Using the Pipes Interface 119Compiling and Running the Hadoop Grep ChainingExample 121Debugging MapReduce 124Listing, Killing, and Job Status 125Hadoop Log Management 125Summary and Additional Resources 1287 Essential Hadoop Tools 131Using Apache Pig 131Pig Example Walk-Through 132Using Apache Hive 134Hive Example Walk-Through 134A More Advanced Hive Example 136Using Apache Sqoop to Acquire Relational Data 139Apache Sqoop Import and Export Methods 139Apache Sqoop Version Changes 140Sqoop Example Walk-Through 142Eadline FM.indd 710/7/15 4:24 AM
viiiContentsUsing Apache Flume to Acquire Data Streams 148Flume Example Walk-Through 151Manage Hadoop Workflows with Apache Oozie 154Oozie Example Walk-Through 156Using Apache HBase 163HBase Data Model Overview 164HBase Example Walk-Through 164Summary and Additional Resources 1698 Hadoop YARN Applications 171YARN Distributed-Shell 171Using the YARN Distributed-Shell 172A Simple Example 174Using More Containers 175Distributed-Shell Examples with ShellArguments 176Structure of YARN Applications 178YARN Application Frameworks 179Distributed-Shell 180Hadoop MapReduce 181Apache Tez 181Apache Giraph 181Hoya: HBase on YARN 181Dryad on YARN 182Apache Spark 182Apache Storm 182Apache REEF: Retainable Evaluator ExecutionFramework 182Hamster: Hadoop and MPI on the SameCluster 183Apache Flink: Scalable Batch and Stream DataProcessing 183Apache Slider: Dynamic ApplicationManagement 183Summary and Additional Resources 1849 Managing Hadoop with Apache Ambari 185Quick Tour of Apache Ambari 186Dashboard View 186Eadline FM.indd 810/7/15 4:24 AM
ContentsixServices View 189Hosts View 191Admin View 193Views View 193Admin Pull-Down Menu 194Managing Hadoop Services 194Changing Hadoop Properties 198Summary and Additional Resources 20410 B asic Hadoop Administration Procedures 205Basic Hadoop YARN Administration 206Decommissioning YARN Nodes 206YARN WebProxy 206Using the JobHistoryServer 207Managing YARN Jobs 207Setting Container Memory 207Setting Container Cores 208Setting MapReduce Properties 208Basic HDFS Administration 208The NameNode User Interface 208Adding Users to HDFS 211Perform an FSCK on HDFS 212Balancing HDFS 213HDFS Safe Mode 214Decommissioning HDFS Nodes 214SecondaryNameNode 214HDFS Snapshots 215Configuring an NFSv3 Gateway to HDFS 217Capacity Scheduler Background 220Hadoop Version 2 MapReduce Compatibility 222Enabling ApplicationMaster Restarts 222Calculating the Capacity of a Node 222Running Hadoop Version 1 Applications 224Summary and Additional Resources 225A Book Webpage and Code Download 227Eadline FM.indd 910/7/15 4:24 AM
xContentsB Getting Started Flowchart and TroubleshootingGuide 229Getting Started Flowchart 229General Hadoop Troubleshooting Guide 229Rule 1: Don’t Panic 229Rule 2: Install and Use Ambari 234Rule 3: Check the Logs 234Rule 4: Simplify the Situation 235Rule 5: Ask the Internet 235Other Helpful Tips 235C Summary of Apache Hadoop Resourcesby Topic 243General Hadoop Information 243Hadoop Installation Recipes 5MapReduce Programming 245Essential Tools 245YARN Application Frameworks 246Ambari Administration 246Basic Hadoop Administration 247D Installing the Hue Hadoop GUI 249Hue Installation 249Steps Performed with Ambari 250Install and Configure Hue 252Starting Hue 253Hue User Interface 253E Installing Apache Spark 257Spark Installation on a Cluster 257Starting Spark across the Cluster 258Installing and Starting Spark on thePseudo-distributed Single-Node Installation 260Run Spark Examples 260Index 261Eadline FM.indd 1010/7/15 4:24 AM
ForewordApache Hadoop 2 introduced new methods of processing and working withdata that moved beyond the basic MapReduce paradigm of the original Hadoop implementation. Whether you are a newcomer to Hadoop or a seasoned professionalwho has worked with the previous version, this book provides a fantastic introductionto the concepts and tools within Hadoop 2.Over the past few years, many projects have fallen under the umbrella of the original Hadoop project to make storing, processing, and collecting large quantities easierwhile integrating with the original Hadoop project. This book introduces many ofthese projects in the larger Hadoop ecosystem, giving readers the high-level basics toget them started using tools that fit their needs.Doug Eadline adapted much of this material from his very popular video seriesHadoop Fundamentals Live Lessons. However, his qualifications don’t stop there. As acoauthor on the in-depth book Apache Hadoop YARN: Moving beyond MapReduce andBatch Processing with Apache Hadoop 2, few are as well qualified to deliver coverage ofHadoop 2 and the new features it brings to users.I’m excited about the great wealth of knowledge that Doug has brought to theseries with his books covering Hadoop and its related projects. This book will be agreat resource for both newcomers looking to learn more about the problems thatHadoop can help them solve and for existing users looking to learn about the benefitsof upgrading to the new version.—Paul Dix, Series EditorEadline FM.indd 1110/7/15 4:24 AM
Eadline FM.indd 1210/7/15 4:24 AM
PrefaceApache Hadoop 2 has changed the data analytics landscape. The Hadoop 2 ecosystem has moved beyond a single MapReduce data processing methodology andframework. That is, Hadoop version 2 offers the Hadoop version 1 methodology toalmost any type of data processing and provides full backward compatibility with thevulnerable MapReduce paradigm from version 1.This change has already had a dramatic effect on many areas of data processingand data analytics. The increased volume of online data has invited new and scalableapproaches to data analytics. As discussed in Chapter 1, the concept of the Hadoopdata lake represents a paradigm shift away from many established approaches to onlinedata usage and storage. A Hadoop version 2 installation is an extensible platform thatcan grow and adapt as both data volumes increase and new processing models becomeavailable.For this reason, the “Hadoop approach” is important and should not be dismissedas a simple “one-trick pony” for Big Data applications. In addition, the open sourcenature of Hadoop and much of the surrounding ecosystem provides an importantincentive for adoption. Thanks to the Apache Software Foundation (ASF), Hadoophas always been an open source project whose inner workings are available to anyone.The open model has allowed vendors and users to share a common goal without lockin or legal barriers that might otherwise splinter a huge and important project such asHadoop. All software used in this book is open source and is freely available. Linksleading to the software are provided at the end of each chapter and in Appendix C.Focus of the BookAs the title implies, this book is a quick-start guide to Hadoop version 2. By design,most topics are summarized, illustrated with an example, and left a bit unfinished.Indeed, many of the tools and subjects covered here are treated elsewhere as completely independent books. Thus, the biggest hurdle in creating a quick-start guide isdeciding what not to include while simultaneously giving the reader a sense of what isimportant.To this end, all topics are designed with what I call the hello-world.c experience.That is, provide some background on what the tool or service does, then provide abeginning-to-end example that allows the reader to get started quickly, and finally,provide resources where additional information and more nitty-gritty details can beEadline FM.indd 1310/7/15 4:24 AM
xivPrefacefound. This approach allows the reader to make changes and implement variations thatmove away from the simple working example to something that solves the reader’sparticular problem. For most of us, our programming experience started from applyingincremental changes to working examples—so the approach in this book should be afamiliar one.Who Should Read This BookThe book is intended for those readers w
pache Hadoop 2 has changed the data analytics landscape. The Hadoop 2 ecosystem has moved beyond a single MapReduce data processing methodology and framework. That is, Hadoop version 2 offers the Hadoop version 1 methodology to almost any type of data processing and provides full backward compatibility with the