Hadoop 2 Quick-Start Guide - Pearsoncmg

Transcription

Hadoop 2Quick-Start Guide Eadline FM.indd 110/7/15 4:24 AM

Eadline FM.indd 210/7/15 4:24 AM

Hadoop 2Quick-Start Guide Learn the Essentials of BigData Computing in the ApacheHadoop 2 Ecosystem Douglas EadlineNew York Boston Indianapolis San FranciscoToronto Montreal London Munich Paris MadridCapetown Sydney Tokyo Singapore Mexico CityEadline FM.indd 310/7/15 4:24 AM

Many of the designations used by manufacturers and sellers to distinguish their products areclaimed as trademarks. Where those designations appear in this book, and the publisher wasaware of a trademark claim, the designations have been printed with initial capital letters or in allcapitals.The author and publisher have taken care in the preparation of this book, but make no expressedor implied warranty of any kind and assume no responsibility for errors or omissions. No liability isassumed for incidental or consequential damages in connection with or arising out of the use ofthe information or programs contained herein.For information about buying this title in bulk quantities, or for special sales opportunities (whichmay include electronic versions; custom cover designs; and content particular to your business,training goals, marketing focus, or branding interests), please contact our corporate sales department at corpsales@pearsoned.com or (800) 382-3419.For government sales inquiries, please contact governmentsales@pearsoned.com.For questions about sales outside the United States, please contact international@pearsoned.com.Visit us on the Web: informit.com/awLibrary of Congress Cataloging-in-Publication DataEadline, Doug, 1956-author.Learn the essential aspects of big data computing in the Apache Hadoop 2 ecosystem /Doug Eadline.pages cmIncludes bibliographical references and index.ISBN 978-0-13-404994-6 (pbk. : alk. paper) —ISBN 0-13-404994-2 (pbk. : alk. paper)1. Big data. 2. Data mining. 3. Apache Hadoop. I. Title.QA76.9.B45E24 2016006.3'12—dc232015030746Copyright 2016 Pearson Education, Inc.Apache , Apache Hadoop , and Hadoop are trademarks of The Apache Software Foundation.Used with permission. No endorsement by The Apache Software Foundation is implied by the useof these marks.All rights reserved. Printed in the United States of America. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction,storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical,photocopying, recording, or likewise. To obtain permission to use material from this work, pleasesubmit a written request to Pearson Education, Inc., Permissions Department, 200 Old TappanRoad, Old Tappan, New Jersey 07675, or you may fax your request to (201) 236-3290.ISBN-13: 978-0-13-404994-6ISBN-10: 0-13-404994-2Text printed in the United States on recycled paper at RR Donnelley in Crawfordsville, Indiana.First printing, November 2015Eadline FM.indd 410/7/15 4:24 AM

gments  xixAbout the Author   xxi1 Background and Concepts   1Defining Apache Hadoop   1A Brief History of Apache Hadoop   3Defining Big Data   4Hadoop as a Data Lake   5Using Hadoop: Administrator, User, or Both   6First There Was MapReduce    7Apache Hadoop Design Principles   7Apache Hadoop MapReduce Example   8MapReduce Advantages   10Apache Hadoop V1 MapReduce Operation   11Moving Beyond MapReduce with Hadoop V2   13Hadoop V2 YARN Operation Design   13The Apache Hadoop Project Ecosystem   15Summary and Additional Resources   182 Installation Recipes   19Core Hadoop Services   19Hadoop Configuration Files   20Planning Your Resources   21Hardware Choices    21Software Choices   22Installing on a Desktop or Laptop   23Installing Hortonworks HDP 2.2 Sandbox   23Installing Hadoop from Apache Sources   29Installing Hadoop with Ambari   40Performing an Ambari Installation   42Undoing the Ambari Install   55Installing Hadoop in the Cloud UsingApache Whirr   56Step 1: Install Whirr    57Step 2: Configure Whirr    57Eadline FM.indd 510/7/15 4:24 AM

viContentsStep 3: Launch the Cluster    59Step 4: Take Down Your Cluster    61Summary and Additional Resources    623 H adoop Distributed File System Basics   63Hadoop Distributed File System DesignFeatures  63HDFS Components   64HDFS Block Replication   67HDFS Safe Mode   68Rack Awareness   68NameNode High Availability   69HDFS Namespace Federation   70HDFS Checkpoints and Backups   71HDFS Snapshots   71HDFS NFS Gateway   72HDFS User Commands   72Brief HDFS Command Reference   72General HDFS Commands   73List Files in HDFS   75Make a Directory in HDFS   76Copy Files to HDFS   76Copy Files from HDFS   76Copy Files within HDFS   76Delete a File within HDFS   76Delete a Directory in HDFS   77Get an HDFS Status Report   77HDFS Web GUI   77Using HDFS in Programs   77HDFS Java Application Example    78HDFS C Application Example   82Summary and Additional Resources   834 R unning Example Programs andBenchmarks  85Running MapReduce Examples   85Listing Available Examples   86Eadline FM.indd 610/7/15 4:24 AM

ContentsviiRunning the Pi Example    87Using the Web GUI to Monitor Examples   89Running Basic Hadoop Benchmarks   95Running the Terasort Test   95Running the TestDFSIO Benchmark   96Managing Hadoop MapReduce Jobs   97Summary and Additional Resources    985 Hadoop MapReduce Framework   101The MapReduce Model    101MapReduce Parallel Data Flow    104Fault Tolerance and Speculative Execution   107Speculative Execution   108Hadoop MapReduce Hardware    108Summary and Additional Resources   1096 MapReduce Programming   111Compiling and Running the HadoopWordCount Example   111Using the Streaming Interface    116Using the Pipes Interface    119Compiling and Running the Hadoop Grep ChainingExample  121Debugging MapReduce    124Listing, Killing, and Job Status    125Hadoop Log Management   125Summary and Additional Resources   1287 Essential Hadoop Tools   131Using Apache Pig   131Pig Example Walk-Through   132Using Apache Hive   134Hive Example Walk-Through   134A More Advanced Hive Example   136Using Apache Sqoop to Acquire Relational Data   139Apache Sqoop Import and Export Methods    139Apache Sqoop Version Changes   140Sqoop Example Walk-Through   142Eadline FM.indd 710/7/15 4:24 AM

viiiContentsUsing Apache Flume to Acquire Data Streams   148Flume Example Walk-Through   151Manage Hadoop Workflows with Apache Oozie   154Oozie Example Walk-Through   156Using Apache HBase   163HBase Data Model Overview   164HBase Example Walk-Through   164Summary and Additional Resources   1698 Hadoop YARN Applications   171YARN Distributed-Shell   171Using the YARN Distributed-Shell   172A Simple Example   174Using More Containers   175Distributed-Shell Examples with ShellArguments  176Structure of YARN Applications   178YARN Application Frameworks   179Distributed-Shell  180Hadoop MapReduce   181Apache Tez   181Apache Giraph   181Hoya: HBase on YARN   181Dryad on YARN   182Apache Spark   182Apache Storm   182Apache REEF: Retainable Evaluator ExecutionFramework  182Hamster: Hadoop and MPI on the SameCluster  183Apache Flink: Scalable Batch and Stream DataProcessing  183Apache Slider: Dynamic ApplicationManagement  183Summary and Additional Resources    1849 Managing Hadoop with Apache Ambari    185Quick Tour of Apache Ambari    186Dashboard View   186Eadline FM.indd 810/7/15 4:24 AM

ContentsixServices View   189Hosts View   191Admin View   193Views View   193Admin Pull-Down Menu   194Managing Hadoop Services    194Changing Hadoop Properties   198Summary and Additional Resources    20410 B asic Hadoop Administration Procedures   205Basic Hadoop YARN Administration    206Decommissioning YARN Nodes   206YARN WebProxy   206Using the JobHistoryServer   207Managing YARN Jobs   207Setting Container Memory   207Setting Container Cores   208Setting MapReduce Properties   208Basic HDFS Administration    208The NameNode User Interface   208Adding Users to HDFS   211Perform an FSCK on HDFS   212Balancing HDFS   213HDFS Safe Mode   214Decommissioning HDFS Nodes   214SecondaryNameNode  214HDFS Snapshots   215Configuring an NFSv3 Gateway to HDFS    217Capacity Scheduler Background    220Hadoop Version 2 MapReduce Compatibility   222Enabling ApplicationMaster Restarts    222Calculating the Capacity of a Node   222Running Hadoop Version 1 Applications    224Summary and Additional Resources   225A Book Webpage and Code Download    227Eadline FM.indd 910/7/15 4:24 AM

xContentsB Getting Started Flowchart and TroubleshootingGuide    229Getting Started Flowchart   229General Hadoop Troubleshooting Guide   229Rule 1: Don’t Panic   229Rule 2: Install and Use Ambari   234Rule 3: Check the Logs   234Rule 4: Simplify the Situation   235Rule 5: Ask the Internet   235Other Helpful Tips   235C Summary of Apache Hadoop Resourcesby Topic   243General Hadoop Information   243Hadoop Installation Recipes   5MapReduce Programming   245Essential Tools   245YARN Application Frameworks   246Ambari Administration   246Basic Hadoop Administration   247D Installing the Hue Hadoop GUI   249Hue Installation    249Steps Performed with Ambari    250Install and Configure Hue   252Starting Hue   253Hue User Interface   253E Installing Apache Spark   257Spark Installation on a Cluster    257Starting Spark across the Cluster   258Installing and Starting Spark on thePseudo-distributed Single-Node Installation   260Run Spark Examples   260Index  261Eadline FM.indd 1010/7/15 4:24 AM

ForewordApache Hadoop 2 introduced new methods of processing and working withdata that moved beyond the basic MapReduce paradigm of the original Hadoop implementation. Whether you are a newcomer to Hadoop or a seasoned professionalwho has worked with the previous version, this book provides a fantastic introductionto the concepts and tools within Hadoop 2.Over the past few years, many projects have fallen under the umbrella of the original Hadoop project to make storing, processing, and collecting large quantities easierwhile integrating with the original Hadoop project. This book introduces many ofthese projects in the larger Hadoop ecosystem, giving readers the high-level basics toget them started using tools that fit their needs.Doug Eadline adapted much of this material from his very popular video seriesHadoop Fundamentals Live Lessons. However, his qualifications don’t stop there. As acoauthor on the in-depth book Apache Hadoop YARN: Moving beyond MapReduce andBatch Processing with Apache Hadoop 2, few are as well qualified to deliver coverage ofHadoop 2 and the new features it brings to users.I’m excited about the great wealth of knowledge that Doug has brought to theseries with his books covering Hadoop and its related projects. This book will be agreat resource for both newcomers looking to learn more about the problems thatHadoop can help them solve and for existing users looking to learn about the benefitsof upgrading to the new version.—Paul Dix, Series EditorEadline FM.indd 1110/7/15 4:24 AM

Eadline FM.indd 1210/7/15 4:24 AM

PrefaceApache Hadoop 2 has changed the data analytics landscape. The Hadoop 2 ecosystem has moved beyond a single MapReduce data processing methodology andframework. That is, Hadoop version 2 offers the Hadoop version 1 methodology toalmost any type of data processing and provides full backward compatibility with thevulnerable MapReduce paradigm from version 1.This change has already had a dramatic effect on many areas of data processingand data analytics. The increased volume of online data has invited new and scalableapproaches to data analytics. As discussed in Chapter 1, the concept of the Hadoopdata lake represents a paradigm shift away from many established approaches to onlinedata usage and storage. A Hadoop version 2 installation is an extensible platform thatcan grow and adapt as both data volumes increase and new processing models becomeavailable.For this reason, the “Hadoop approach” is important and should not be dismissedas a simple “one-trick pony” for Big Data applications. In addition, the open sourcenature of Hadoop and much of the surrounding ecosystem provides an importantincentive for adoption. Thanks to the Apache Software Foundation (ASF), Hadoophas always been an open source project whose inner workings are available to anyone.The open model has allowed vendors and users to share a common goal without lockin or legal barriers that might otherwise splinter a huge and important project such asHadoop. All software used in this book is open source and is freely available. Linksleading to the software are provided at the end of each chapter and in Appendix C.Focus of the BookAs the title implies, this book is a quick-start guide to Hadoop version 2. By design,most topics are summarized, illustrated with an example, and left a bit unfinished.Indeed, many of the tools and subjects covered here are treated elsewhere as completely independent books. Thus, the biggest hurdle in creating a quick-start guide isdeciding what not to include while simultaneously giving the reader a sense of what isimportant.To this end, all topics are designed with what I call the hello-world.c experience.That is, provide some background on what the tool or service does, then provide abeginning-to-end example that allows the reader to get started quickly, and finally,provide resources where additional information and more nitty-gritty details can beEadline FM.indd 1310/7/15 4:24 AM

xivPrefacefound. This approach allows the reader to make changes and implement variations thatmove away from the simple working example to something that solves the reader’sparticular problem. For most of us, our programming experience started from applyingincremental changes to working examples—so the approach in this book should be afamiliar one.Who Should Read This BookThe book is intended for those readers w

pache Hadoop 2 has changed the data analytics landscape. The Hadoop 2 ecosystem has moved beyond a single MapReduce data processing methodology and framework. That is, Hadoop version 2 offers the Hadoop version 1 methodology to almost any type of data processing and provides full backward compatibility with the