Optimizing Hadoop For MapReduce - DropPDF

Transcription

Optimizing Hadoop forMapReduceLearn how to configure your Hadoop cluster to runoptimal MapReduce jobsKhaled TannirBIRMINGHAM - MUMBAI

Optimizing Hadoop for MapReduceCopyright 2014 Packt PublishingAll rights reserved. No part of this book may be reproduced, stored in a retrievalsystem, or transmitted in any form or by any means, without the prior writtenpermission of the publisher, except in the case of brief quotations embedded incritical articles or reviews.Every effort has been made in the preparation of this book to ensure the accuracyof the information presented. However, the information contained in this book issold without warranty, either expressed or implied. Neither the author, nor PacktPublishing, and its dealers and distributors will be held liable for any damagescaused or alleged to be caused directly or indirectly by this book.Packt Publishing has endeavored to provide trademark information about all of thecompanies and products mentioned in this book by the appropriate use of capitals.However, Packt Publishing cannot guarantee the accuracy of this information.First published: February 2014Production Reference: 1140214Published by Packt Publishing Ltd.Livery Place35 Livery StreetBirmingham B3 2PB, UK.ISBN 978-1-78328-565-5www.packtpub.comCover Image by Khaled Tannir (contact@khaledtannir.net)

CreditsAuthorKhaled TannirReviewersProject CoordinatorAboli AmbardekarProofreadersWłodzimierz BzylSimran BhogalCraig HendersonAmeesha GreenMark KerznerIndexerAcquisition EditorRekha NairJoanne FitzpatrickGraphicsCommissioning EditorYuvraj MannariManasi PandireProduction CoordinatorsTechnical EditorsMario D'SouzaManu JosephAlwin RoyRosmy GeorgePramod KumavatArwa ManasawalaAdrian RaposoCopy EditorsKirti PaiLaxmi SubramanianCover WorkAlwin Roy

About the AuthorKhaled Tannir has been working with computers since 1980. He beganprogramming with the legendary Sinclair Zx81 and later with Commodore homecomputer products (Vic 20, Commodore 64, Commodore 128D, and Amiga 500).He has a Bachelor's degree in Electronics, a Master's degree in System InformationArchitectures, in which he graduated with a professional thesis, and completed hiseducation with a Master of Research degree.He is a Microsoft Certified Solution Developer (MCSD) and has more than 20 yearsof technical experience leading the development and implementation of softwaresolutions and giving technical presentations. He now works as an independent ITconsultant and has worked as an infrastructure engineer, senior developer, andenterprise/solution architect for many companies in France and Canada.With significant experience in Microsoft .Net, Microsoft Server Systems,and Oracle Java technologies, he has extensive skills in online/offline applicationsdesign, system conversions, and multilingual applications in both domains: Internetand Desktops.He is always researching new technologies, learning about them, and looking fornew adventures in France, North America, and the Middle-east. He owns an IT andelectronics laboratory with many servers, monitors, open electronic boards suchas Arduino, Netduino, RaspBerry Pi, and .Net Gadgeteer, and some smartphonedevices based on Windows Phone, Android, and iOS operating systems.In 2012, he contributed to the EGC 2012 (International Complex Data Mining forum atBordeaux University, France) and presented, in a workshop session, his work on "howto optimize data distribution in a cloud computing environment". This work aims todefine an approach to optimize the use of data mining algorithms such as k-means andApriori in a cloud computing environment.He is the author of RavenDB 2.x Beginner's Guide, Packt Publishing.

He aims to get a PhD in Cloud Computing and Big Data and wants to learn moreand more about these technologies.He enjoys taking landscape and night time photos, travelling, playing video games,creating funny electronic gadgets with Arduino/.Net Gadgeteer, and of course,spending time with his wife and family.You can reach him at contact@khaledtannir.net.

AcknowledgmentsAll praise is due to Allah, the Lord of the Worlds. First, I must thank Allah for givingme the ability to think and write.Next, I would like to thank my wife, Laila, for her big support, encouragement, andpatience throughout this project. Also, I would like to thank my family in Canadaand Lebanon for their support during the writing of this book.I would like to thank everyone at Packt Publishing for their help and guidance, andfor giving me the opportunity to share my experience and knowledge in technologywith others in the Hadoop and MapReduce community.Thank you as well to the technical reviewers, who provided great feedback to ensurethat every tiny technical detail was accurate and rich in content.

About the ReviewersWłodzimierz Bzyl works at the University of Gdańsk, Poland. His currentinterests include web-related technologies and NoSQL databases. He has a passionfor new technologies and introduces his students to them. He enjoys contributingto open source software and spending time trekking in the Tatra mountains.Craig Henderson graduated in 1995 with a degree in Computing for Real-timeSystems and has spent his career working on large-scale data processing anddistributed systems. He is the author of an open source C MapReduce libraryfor single server application scalability, which is available at https://github.com/cdmh/mapreduce, and he currently researches image and video processingtechniques for person identification.Mark Kerzner holds degrees in Law, Mathematics, and Computer Science. He hasbeen designing software for many years and Hadoop-based systems since 2008. He isthe President of SHMsoft, a provider of Hadoop applications for various verticals, aco-founder of the Hadoop Illuminated training and consulting, and also theco-author of the open source book, Hadoop Illuminated. He has also authored andco-authored other books and patents.I would like to acknowledge the help of my colleagues, in particularSujee Maniyam, and last but not least, my multitalented family.

www.PacktPub.comSupport files, eBooks, discount offers and moreYou might want to visit www.PacktPub.com for support files and downloads relatedto your book.Did you know that Packt offers eBook versions of every book published, with PDFand ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy.Get in touch with us at service@packtpub.com for more details.At www.PacktPub.com, you can also read a collection of free technical articles, signup for a range of free newsletters and receive exclusive discounts and offers on Packtbooks and eBooks.TMhttp://PacktLib.PacktPub.comDo you need instant solutions to your IT questions? PacktLib is Packt's onlinedigital book library. Here, you can access, read and search across Packt's entirelibrary of books.Why Subscribe? Fully searchable across every book published by PacktCopy and paste, print and bookmark contentOn demand and accessible via web browserFree Access for Packt account holdersIf you have an account with Packt at www.PacktPub.com, you can use this to accessPacktLib today and view nine entirely free books. Simply use your login credentialsfor immediate access.

Table of ContentsPrefaceChapter 1: Understanding Hadoop MapReduceThe MapReduce modelAn overview of Hadoop MapReduceHadoop MapReduce internalsFactors affecting the performance of MapReduceSummaryChapter 2: An Overview of the Hadoop ParametersInvestigating the Hadoop parametersThe mapred-site.xml configuration fileThe CPU-related parametersThe disk I/O related parametersThe memory-related parametersThe network-related parametersThe hdfs-site.xml configuration fileThe core-site.xml configuration fileHadoop MapReduce metricsPerformance monitoring toolsUsing Chukwa to monitor HadoopUsing Ganglia to monitor HadoopUsing Nagios to monitor HadoopUsing Apache Ambari to monitor HadoopSummaryChapter 3: Detecting System BottlenecksPerformance tuningCreating a performance baselineIdentifying resource bottlenecksIdentifying RAM 29293132333637

Table of ContentsIdentifying CPU bottlenecksIdentifying storage bottlenecksIdentifying network bandwidth bottlenecksSummary37383940Chapter 4: Identifying Resource Weaknesses41Chapter 5: Enhancing Map and Reduce Tasks55Chapter 6: Optimizing MapReduce Tasks71Chapter 7: Best Practices and Recommendations87Identifying cluster weaknessChecking the Hadoop cluster node's healthChecking the input data sizeChecking massive I/O and network trafficChecking for insufficient concurrent tasksChecking for CPU contentionSizing your Hadoop clusterConfiguring your cluster correctlySummaryEnhancing map tasksInput data and block size impactDealing with small and unsplittable filesReducing spilled records during the Map phaseCalculating map tasks' throughputEnhancing reduce tasksCalculating reduce tasks' throughputImproving Reduce execution phaseTuning map and reduce parametersSummaryUsing CombinersUsing compressionUsing appropriate Writable typesReusing types smartlyOptimizing mappers and reducers codeSummaryHardware tuning and OS recommendationsThe Hadoop cluster checklistThe Bios tuning checklistOS configuration recommendationsHadoop best practices and recommendationsDeploying Hadoop[ ii 5878888888989

Table of ContentsHadoop tuning recommendationsUsing a MapReduce template class codeSummaryIndex[ iii ]90929799

PrefaceMapReduce is an important parallel processing model for large-scale, data-intensiveapplications such as data mining and web indexing. Hadoop, an open sourceimplementation of MapReduce, is widely applied to support cluster computingjobs that require low response time.Most of the MapReduce programs are written for data analysis and they usuallytake a long time to finish. Many companies are embracing Hadoop for advanceddata analytics over large datasets that require time completion guarantees.Efficiency, especially the I/O costs of MapReduce, still needs to be addressedfor successful implications. The experience shows that a misconfigured Hadoopcluster can noticeably reduce and significantly downgrade the performance ofMapReduce jobs.In this book, we address the MapReduce optimization problem, how to identifyshortcomings, and what to do to get using all of the Hadoop cluster's resourcesto process input data optimally. This book starts off with an introduction toMapReduce to learn how it works internally, and discusses the factors that canaffect its performance. Then it moves forward to investigate Hadoop metrics andperformance tools, and identifies resource weaknesses such as CPU contention,memory usage, massive I/O storage, and network traffic.This book will teach you, in a step-by-step manner based on real-world experience,how to eliminate your job bottlenecks and fully optimize your MapReduce jobs in aproduction environment. Also, you will learn to calculate the right number of clusternodes to process your data, to define the right number of mapper and reducer tasksbased on your hardware resources, and how to optimize mapper and reducer taskperformances using compression technique and combiners.Finally, you will learn the best practices and recommendations to tune yourHadoop cluster and learn what a MapReduce template class looks like.

PrefaceWhat this book coversChapter 1, Understanding Hadoop MapReduce, explains how MapReduce worksinternally and the factors that affect MapReduce performance.Chapter 2, An Overview of the Hadoop Parameters, introduces Hadoop configurationfiles and MapReduce performance-related parameters. It also explains Hadoopmetrics and several performance monitoring tools that you can use to monitorHadoop MapReduce activities.Chapter 3, Detecting System Bottlenecks, explores Hadoop MapReduce performancetuning cycle and explains how to create a performance baseline. Then you will learnto identify resource bottlenecks and weaknesses based on Hadoop counters.Chapter 4, Identifying Resource Weaknesses, explains how to check the Hadoop cluster'shealth and identify CPU and memory usage, massive I/O storage, and network traffic.Also, you will learn how to scale correctly when configuring your Hadoop cluster.Chapter 5, Enhancing Map and Reduce Tasks, shows you how to enhance map andreduce task execution. You will learn the impact of block size, how to reducespilling records, determine map and reduce throughput, and tune MapReduceconfiguration parameters.Chapter 6, Optimizing MapReduce Tasks, explains when you need to use combinersand compression techniques to optimize map and reduce tasks and introducesseveral techniques to optimize your application code.Chapter 7, Best Practices and Recommendations, introduces miscellaneous hardwareand software checklists, recommendations, and tuning properties in order to useyour Hadoop cluster optimally.What you need for this bookApache Hadoop framework (http://hadoop.apache.org/) with access to acomputer running Hadoop on a Linux operating system.Who this book is forIf you are an experienced MapReduce user or developer, this book will be great foryou. The book can also be a very helpful guide if you are a MapReduce beginner oruser who wants to try new things and learn techniques to optimize your applications.Knowledge of creating a MapReduce application is not required, but will help youto grasp some of the concepts quicker and become more familiar with the snippetsof MapReduce class template code.[2]

PrefaceConventionsIn this book, you will find a number of styles of text that distinguish betweendifferent kinds of information. Here are some examples of these styles, and anexplanation of their meaning.Code words in text, database table names, folder names, filenames, file extensions,pathnames, dummy URLs, user input, and Twitter handles are shown as follows:"W

to identify resource bottlenecks and weaknesses based on Hadoop counters. Chapter 4, Identifying Resource Weaknesses, explains how to check the Hadoop cluster's health and identify CPU and memory usage, massive I/O storage, and network traffic. Also, you will learn how to scale correctly when configuring your Hadoop cluster.