Hadoop: The Definitive Guide - Meetup

Transcription

Hadoop: The Definitive GuideTom Whiteforeword by Doug CuttingBeijing Cambridge Farnham Köln Sebastopol Taipei Tokyo

Hadoop: The Definitive Guideby Tom WhiteCopyright 2009 Tom White. All rights reserved.Printed in the United States of America.Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.O’Reilly books may be purchased for educational, business, or sales promotional use. Online editionsare also available for most titles (http://my.safaribooksonline.com). For more information, contact ourcorporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com.Editor: Mike LoukidesProduction Editor: Loranah DimantProofreader: Nancy KotaryIndexer: Ellen Troutman ZaigCover Designer: Karen MontgomeryInterior Designer: David FutatoIllustrator: Robert RomanoPrinting History:June 2009:First Edition.Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks ofO’Reilly Media, Inc. Hadoop: The Definitive Guide, the image of an African elephant, and related tradedress are trademarks of O’Reilly Media, Inc.Many of the designations used by manufacturers and sellers to distinguish their products are claimed astrademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of atrademark claim, the designations have been printed in caps or initial caps.While every precaution has been taken in the preparation of this book, the publisher and author assumeno responsibility for errors or omissions, or for damages resulting from the use of the information contained herein.TMThis book uses RepKover , a durable and flexible lay-flat binding.ISBN: 978-0-596-52197-4[M]1243455573

For Eliane, Emilia, and Lottie

Table of ContentsForeword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Data!Data Storage and AnalysisComparison with Other SystemsRDBMSGrid ComputingVolunteer ComputingA Brief History of HadoopThe Apache Hadoop Project1344689122. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15A Weather DatasetData FormatAnalyzing the Data with Unix ToolsAnalyzing the Data with HadoopMap and ReduceJava MapReduceScaling OutData FlowCombiner FunctionsRunning a Distributed MapReduce JobHadoop StreamingRubyPythonHadoop PipesCompiling and Running151517181820272729323233353638v

3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41The Design of HDFSHDFS ConceptsBlocksNamenodes and DatanodesThe Command-Line InterfaceBasic Filesystem OperationsHadoop FilesystemsInterfacesThe Java InterfaceReading Data from a Hadoop URLReading Data Using the FileSystem APIWriting DataDirectoriesQuerying the FilesystemDeleting DataData FlowAnatomy of a File ReadAnatomy of a File WriteCoherency ModelParallel Copying with distcpKeeping an HDFS Cluster BalancedHadoop ArchivesUsing Hadoop 363666870717172734. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Data IntegrityData Integrity in decsCompression and Input SplitsUsing Compression in MapReduceSerializationThe Writable InterfaceWritable ClassesImplementing a Custom WritableSerialization FrameworksFile-Based Data StructuresSequenceFileMapFilevi Table of Contents757576777779838486878996101103103110

5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115The Configuration APICombining ResourcesVariable ExpansionConfiguring the Development EnvironmentManaging ConfigurationGenericOptionsParser, Tool, and ToolRunnerWriting a Unit TestMapperReducerRunning Locally on Test DataRunning a Job in a Local Job RunnerTesting the DriverRunning on a ClusterPackagingLaunching a JobThe MapReduce Web UIRetrieving the ResultsDebugging a JobUsing a Remote DebuggerTuning a JobProfiling TasksMapReduce WorkflowsDecomposing a Problem into MapReduce JobsRunning Dependent 341361381441451461491491516. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153Anatomy of a MapReduce Job RunJob SubmissionJob InitializationTask AssignmentTask ExecutionProgress and Status UpdatesJob CompletionFailuresTask FailureTasktracker FailureJobtracker FailureJob SchedulingThe Fair SchedulerShuffle and SortThe Map SideThe Reduce 64Table of Contents vii

Configuration TuningTask ExecutionSpeculative ExecutionTask JVM ReuseSkipping Bad RecordsThe Task Execution Environment1661681691701711727. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175MapReduce TypesThe Default MapReduce JobInput FormatsInput Splits and RecordsText InputBinary InputMultiple InputsDatabase Input (and Output)Output FormatsText OutputBinary OutputMultiple OutputsLazy OutputDatabase Output1751781841851961992002012022022032032102108. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211CountersBuilt-in CountersUser-Defined Java CountersUser-Defined Streaming CountersSortingPreparationPartial SortTotal SortSecondary SortJoinsMap-Side JoinsReduce-Side JoinsSide Data DistributionUsing the Job ConfigurationDistributed CacheMapReduce Library 392439. Setting Up a Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245Cluster Specificationviii Table of Contents245

Network TopologyCluster Setup and InstallationInstalling JavaCreating a Hadoop UserInstalling HadoopTesting the InstallationSSH ConfigurationHadoop ConfigurationConfiguration ManagementEnvironment SettingsImportant Hadoop Daemon PropertiesHadoop Daemon Addresses and PortsOther Hadoop PropertiesPost InstallBenchmarking a Hadoop ClusterHadoop BenchmarksUser JobsHadoop in the CloudHadoop on Amazon 726926926910. Administering Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273HDFSPersistent Data StructuresSafe ModeAudit LoggingToolsMonitoringLoggingMetricsJava Management ExtensionsMaintenanceRoutine Administration ProceduresCommissioning and Decommissioning 9611. Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301Installing and Running PigExecution TypesRunning Pig ProgramsGruntPig Latin EditorsAn ExampleGenerating Examples302302304304305305307Table of Contents ix

Comparison with DatabasesPig ctionsUser-Defined FunctionsA Filter UDFAn Eval UDFA Load UDFData Processing OperatorsLoading and Storing DataFiltering DataGrouping and Joining DataSorting DataCombining and Splitting DataPig in PracticeParallelismParameter 133133133433833934034034112. HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343HBasicsBackdropConceptsWhirlwind Tour of the Data ModelImplementationInstallationTest DriveClientsJavaREST and ThriftExampleSchemasLoading DataWeb QueriesHBase Versus RDBMSSuccessful ServiceHBaseUse Case: HBase at streamy.comPraxisVersionsx Table of 361362363363365365

Love and Hate: HBase and HDFSUIMetricsSchema Design36636736736713. ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369Installing and Running ZooKeeperAn ExampleGroup Membership in ZooKeeperCreating the GroupJoining a GroupListing Members in a GroupDeleting a GroupThe ZooKeeper ServiceData atesBuilding Applications with ZooKeeperA Configuration ServiceThe Resilient ZooKeeper ApplicationA Lock ServiceMore Distributed Data Structures and ProtocolsZooKeeper in ProductionResilience and 938038438638838939139139439840040140140214. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405Hadoop Usage at Last.fmLast.fm: The Social Music RevolutionHadoop at Last.fmGenerating Charts with HadoopThe Track Statistics ProgramSummaryHadoop and Hive at FacebookIntroductionHadoop at FacebookHypothetical Use Case StudiesHiveProblems and Future WorkNutch Search Engine405405405406407414414414414417420424425Table of Contents xi

BackgroundData StructuresSelected Examples of Hadoop Data Processing in NutchSummaryLog Processing at RackspaceRequirements/The ProblemBrief HistoryChoosing HadoopCollection and StorageMapReduce for LogsCascadingFields, Tuples, and PipesOperationsTaps, Schemes, and FlowsCascading in PracticeFlexibilityHadoop and Cascading at ShareThisSummaryTeraByte Sort on Apache 4456457461461A. Installing Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465B. Cloudera’s Distribution for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471C. Preparing the NCDC Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479xii Table of Contents

ForewordHadoop got its start in Nutch. A few of us were attempting to build an open sourceweb search engine and having trouble managing computations running on even ahandful of computers. Once Google published its GFS and MapReduce papers, theroute became clear. They’d devised systems to solve precisely the problems we werehaving with Nutch. So we started, two of us, half-time, to try to recreate these systemsas a part of Nutch.We managed to get Nutch limping along on 20 machines, but it soon became clear thatto handle the Web’s massive scale, we’d need to run it on thousands of machines and,moreover, that the job was bigger than two half-time developers could handle.Around that time, Yahoo! got interested, and quickly put together a team that I joined.We split off the distributed computing part of Nutch, naming it Hadoop. With the helpof Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web.In 2006, Tom White started contributing to Hadoop. I already knew Tom through anexcellent article he’d written about Nutch, so I knew he could present complex ideasin clear prose. I soon learned that he could also develop software that was as pleasantto read as his prose.From the beginning, Tom’s contributions to Hadoop showed his concern for users andfor the project. Unlike most open source contributors, Tom is not primarily interestedin tweaking the system to better meet his own needs, but rather in making it easier foranyone to use.Initially, Tom specialized in making Hadoop run well on Amazon’s EC2 and S3 services. Then he moved on to tackle a wide variety of problems, including improving theMapReduce APIs, enhancing the website, and devising an object serialization framework. In all cases, Tom presented his ideas precisely. In short order, Tom earned therole of Hadoop committer and soon thereafter became a member of the Hadoop ProjectManagement Committee.Tom is now a respected senior member of the Hadoop developer community. Thoughhe’s an expert in many technical corners of the project, his specialty is making Hadoopeasier to use and understand.xiii

Given this, I was very pleased when I learned that Tom intended to write a book aboutHadoop. Who could be better qualified? Now you have the opportunity to learn aboutHadoop from a master—not only of the technology, but also of common sense andplain talk.—Doug CuttingShed in the Yard, Californiaxiv Foreword

PrefaceMartin Gardner, the mathematics and science writer, once said in an interview:Beyond calculus, I am lost. That was the secret of my column’s success. It took me solong to understand what I was writing about that I knew how to write in a way mostreaders would understand.*In many ways, this is how I feel about Hadoop. Its inner workings are complex, restingas they do on a mixture of distributed systems theory, practical engineering, and common sense. And to the uninitiated, Hadoop can appear alien.But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop providesfor building distributed systems—for data storage, data analysis, and coordination—are simple. If there’s a common theme, it is about raising the level of abstraction—tocreate building blocks for programmers who just happen to have lots of data to store,or lots of data to analyze, or lots of machines to coordinate, and who don’t have thetime, the skill, or the inclination to become distributed systems experts to build theinfrastructure to handle it.With such a simple and generally applicable feature set, it seemed obvious to me whenI started using it that Hadoop deserved to be widely used. However, at the time (inearly 2006), setting up, configuring, and writing programs to use Hadoop was an art.Things have certainly improved since then: there is more documentation, there aremore examples, and there are thriving mailing lists to go to when you have questions.And yet the biggest hurdle for newcomers is understanding what this technology iscapable of, where it excels, and how to use it. That is why I wrote this book.The Apache Hadoop community has come a long way. Over the course of three years,the Hadoop project has blossomed and spun off half a dozen subprojects. In this time,the software has made great leaps in performance, reliability, scalability, and manageability. To gain even wider adoption, however, I believe we need to make Hadoop eveneasier to use. This will involve writing more tools; integrating with more systems; and* “The science of fun,” Alex Bellos, The Guardian, May 31, 2008, s.science.xv

writing new, improved APIs. I’m looking forward to being a part of this, and I hopethis book will encourage and enable others to do so, too.Administrative NotesDuring discussion of a particular Java class in the text, I often omit its package name,to reduce clutter. If you need to know which package a class is in, you can easily lookit up in Hadoop’s Java API documentation for the relevant subproject, linked to fromthe Apache Hadoop home page at http://hadoop.apache.org/. Or if you’re using an IDE,it can help using its auto-complete mechanism.Similarly, although it deviates from usual style guidelines, program listings that importmultiple classes from the same package may use the asterisk wildcard character to savespace (for example: import org.apache.hadoop.io.*).The sample programs in this book are available for download from the website thataccompanies this book: http://www.hadoopbook.com/. You will also find instructionsthere for obtaining the datasets that are used in examples throughout the book, as wellas further notes for running the programs in the book, and links to updates, additionalresources, and my blog.What’s in This Book?The rest of this book is organized as follows. Chapter 2 provides an introduction toMapReduce. Chapter 3 looks at Hadoop filesystems, and in particular HDFS, in depth.Chapter 4 covers the fundamentals of I/O in Hadoop: data integrity, compression,serialization, and file-based data structures.The next four chapters cover MapReduce in depth. Chapter 5 goes through the practicalsteps needed to develop a MapReduce application. Chapter 6 looks at how MapReduceis implemented in Hadoop, from the point of view of a user. Chapter 7 is about theMapReduce programming model, and the various data formats that MapReduce canwork with. Chapter 8 is on advanced MapReduce topics, including sorting and joiningdata.Chapters 9 and 10 are for Hadoop administrators, and describe how to set up andmaintain a Hadoop cluster running HDFS and MapReduce.Chapters 11, 12, and 13 present Pig, HBase, and ZooKeeper, respectively.Finally, Chapter 14 is a collection of case studies contributed by members of the ApacheHadoop community.xvi Preface

Conventions Used in This BookThe following typographical conventions are used in this book:ItalicIndicates new terms, URLs, email addresses, filenames, and file extensions.Constant widthUsed for program listings, as well as within paragraphs to refer to program elementssuch as variable or function names, databases, data types, environment variables,statements, and keywords.Constant width boldShows commands or other text that should be typed literally by the user.Constant width italicShows text that should be replaced with user-supplied values or by values determined by context.This icon signifies a tip, suggestion, or general note.This icon indicates a warning or caution.Using Code ExamplesThis book is here to help you get your job done. In general, you may use the code inthis book in your programs and documentation. You do not need to contact us forpermission unless you’re reproducing a significant portion of the code. For example,writing a program that uses several chunks of code from this book does not requirepermission. Selling or distributing a CD-ROM of examples from O’Reilly books doesrequire permission. Answering a question by citing this book and quoting examplecode does not require permission. Incorporating a significant amount of example codefrom this book into your product’s documentation does require permission.We appreciate, but do not require, attribution. An attribution usually includes the title,author, publisher, and ISBN. For example: “Hadoop: The Definitive Guide, by TomWhite. Copyright 2009 Tom White, 978-0-596-52197-4.”If you feel your use of code examples falls outside fair use or the permission given above,feel free to contact us at permissions@oreilly.com.Preface xvii

Safari Books OnlineWhen you see a Safari Books Online icon on the cover of your favoritetechnology book, that means the book is available online through theO’Reilly Network Safari Bookshelf.Safari offers a solution that’s better than e-books. It’s a virtual library that lets you easilysearch thousands of top tech books, cut and paste code samples, download chapters,and find quick answers when you need the most accurate, current information. Try itfor free at http://my.safaribooksonline.com.How to Contact UsPlease address comments and questions concerning this book to the publisher:O’Reilly Media, Inc.1005 Gravenstein Highway NorthSebastopol, CA 95472800-998-9938 (in the United States or Canada)707-829-0515 (international or local)707-829-0104 (fax)We have a web page for this book, where we list errata, examples, and any additionalinformation. You can access this page at:http://www.oreilly.com/catalog/9780596521974The author also has a site for this book at:http://www.hadoopbook.com/To comment or ask technical questions about this book, send email to:bookquestions@oreilly.comFor more information about our books, conferences, Resource Centers, and theO’Reilly Networ

Selected Examples of Hadoop Data Processing in Nutch 429 Summary 438 Log Processing at Rackspace 439 Requirements/The Problem 439 Brief History 440 Choosing Hadoop 440 Collection and Storage 440 MapReduce for Logs 442 Cascading 447 Fields, Tuples, and Pipes 448 Operations 451 Taps, Schemes, and Flows 452 Cascading in Practice 454 Flexibility 456