CS-495/595 Hadoop (part 2) Lecture #5 Dr. Chuck Cartledge Dr. Chuck .

Transcription

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesCS-495/595Hadoop (part 2)Lecture #5Dr. Chuck Cartledge11 Feb. 20151/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesTable of contents I1Miscellanea2The Book7Break3Chapter 38Assignment #24Chapter 49Exam5Chapter 610Conclusion6Chapter 811References2/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesCorrections and additions since last lecture.Updated assignment #2(check the assignmentwrite-up)“Google is serious abouttaking on telecom”[3]White House working onSafeguarding AmericanConsumers and Families[5, 11]Samsung Smart TV islistening to you [10, 9]3/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHadoop, The Definitive GuideVersion 3 is specified in thesyllabus [12]Version 4 came out inNovember 2015We’ll use Version 3 as muchas possible4/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesThe HDFSThings that HDFS is good at:LARGE files ( terabyte)Streaming access (WORM)Commodity hardware(failures are common)Image from [7].HDFS is a robust, reliable distributed file system.5/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesThe HDFSThings that HDFS is not good at:Low-latency data access(HDFS is based on an RPCmodel)Lots of small files (overheadper file is constant)Multiple writers, writes onlyat the end of a fileImage from [7].File access can be slow because of RPC overhead.6/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesThe HDFSNamenode and datanodesNamenode containsmeta-data about filesDatanodes contain blocksFile blocks are replicatedacross datanodesLoss of a datanode can bedetected and a replicatecreatedLoss of a namenode iscatastrophicImage from [4].All communication is via RPC.7/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesThe HDFSA better view8/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesThe HDFSWhat happens if you lose your namenode?Namenode is a single point of failure.Have secondary namenodeavailableNamenode keeps edit log ofdatanode actionsNamenode monitors health ofdatanodesNew primary namenode has to readedit log, get state of all datanodesLarge cluster can take up to 30 minutesto become fully functional.Namenode should be run on a highly reliable hardware suite.9/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesThe HDFSRemote Procedure Call (RPC)Client makes a procedurecallData is serializedData is sent to serverData is deserializedData is processedAny returned data ishandled in the same wayProgrammers write to proceduresand network messiness is hidden.Image from [6].Attributed to Birrell and Nelson [1].10/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHadoop I/OHadoop supports file I/OHDFS’s primary concernsData integrity – ensuringthat data is complete andintact123Checks CRCsBit rotCreates new replicationswhen and where necessaryData compression123Minimizing data sizeNetwork activityAdds processing timeHadoop ensure data integrity, minimizes data size, is not fast.11/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHadoop I/ONot all data compression algorithms are the sameAll compression routines have the same basic goal.File extensions areassumed/expectedNot all algorithms are CLIcompatibleNot all compressed files aresplittableSplittable is important. If afile can not be split, therecan only be one reader.Low level routines are available, some work better than others.12/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHadoop I/OComparing compression algorithmsAll algorithms trade off space and timeCompressing 145,293,291 1,913,688Comp.33161Dec.910gzip is a good, middle of the road performer.13/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHadoop I/ODesign considerationsRaw (uncompressed) can besplit on 64M boundariesCompressed and unsplittablefile will support one readerCompressed and splittablefile can support multiplereadersStore file uncompressedUse compression thatsupports splittingUse Mapper to split fileUnsplittable files result in a single Mapper instance.14/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHadoop I/OSerializationThe process of turning structured objects into a byte stream.Used extensively in Hadoopinter-process ableHadoop serialization is not Java serialization.15/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHadoop I/OSummaryThe HDFS is:Optimized for LARGE filesDistributed, robust, andresilientSupports multiple readersLimited support for writersHas native support for rawand compressed filesMost file operations are RPCbased.The HDFS should be considered a WORM system.16/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHow MapReduce worksClassic organizationClient – our CLINodes may be on differentmachinesCommunication between machinesis via HFDSHeart beat messages12345Tasktracker – every 5 secondsNo heartbeat after 10 minutes –node is down and won’t use itChild process – every few secondsJobtracker – every secondProgress – every second (justindicates nothing is “stuck”)Remember: Hadoop is in Java, Mapperand Reducers may not beLots of timers and periodics to monitor activity and detect whensomething is “hung” or dead.17/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHow MapReduce worksProgressWhat is it and how is it reported? Not possible to show absolute progressbecause there may not be anyway to know ahead of time how much workneeds to be done.Have to report something:Reading by a Mapper orReducerWriting by a Mapper orReducerSetting the status descriptionIncrementing a counter(expensive operation)Using the progress() functionIf progress is not being made, Hadoop will terminate the processes.18/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHow MapReduce worksClassic Hadoop vs. YARN HadoopHadoop version 1 vs. Yet Another Resource Negotiator (YARN)Hadoop version 2Architectural differences:Job tracker and Task trackerreplaced by Resource Manager,Application Master, and NodemanagerMappers and Reducers written thesame wayInterfaces allow things to beswapped out with minimal impact.Image from [8]. .Hadoop administrators are intimately concerned with thedifferences between classic and YARN installations.19/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHow MapReduce worksSchedulersWhen will MY job be run?Different types of schedulers:FIFO – default in Hadoopver. 1 – first come firstservedFair – also available – jobsplaced in user pool, one jobper pool is scheduledCapacity – default inHadoop ver. 2 – similar toFair, but adds priorities andrelationships between poolsImage from [13].Different schedulers do things differently.20/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesHow MapReduce worksA scheduler example.The best scheduler is the one that serves you best.21/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesMapReduce FeaturesCountersHadoop has its own counters, and supports global user definedcounters.Applications can access andincrement countersCounters are global, acrossall Mappers and Reducers,so incrementing them canbe expensiveDetails, counters are definedby Java enumHadoop supplies counters, you can create counters, you can usecounters.22/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesMapReduce FeaturesSortingSorting is based on the “Key”Keys can be:Simple – RawComparator()will workCompound – comparator()and partitioner() need towork on the correct part ofthe “key”Image from [2].Sorting can be very complex, depending on your application.23/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesMapReduce FeaturesJoinsThese are exactly the same as traditional SQL database joins.Depending on your application, joinscan happen:Mapper – inputs have to bestrictly partitionedReducer – inputs have to be“tagged” to be processedcorrectlyImage from [2].“MapReduce can perform joins between large datasets, but writing the code todo joins from scratch is fairly involved. Rather than writing MapReduceprograms, you might consider using a higher-level framework such as Pig, Hive,or Cascading, in which join operations are a core part of the implementation.”[12]24/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesBreak time.Take about 10 minutes.25/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesAn inverted word list.Looking at where words are used.A simply stated problem: whereare certain words used?Undergrad students – whichlines have the word “loue”Grad student – which linesof have the word “loue”,which have the word“course”, and which havebothInterested in the line numbersand the line itself.An example: 1408: And wonnethy loue, doing thee iniuries:26/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesMechanics of the examClosed bookNo “cheat” sheetsAnything from the lectures(and supporting material) isfair gameAnything that was discussedin class is fair gameAnything that should havebeen experienced, orencountered in theassignments is fair gameEach question will have twoparts12An undergrad partA graduate partUndergrads can attempt graduate part without penality.27/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesWhat have we covered?Spent time on the HDFS,identifying its strengths andweaknessesDiscussed importance of HDFSname and data nodesWent over RPCs and its strengthsand weaknessesTalked about Hadoop file I/O andcompressionTalked about Hadoop ver. 1 andver. 2Talked about assignment #2Talked about the examChapter 11 will NOT be on theexamNext lecture: Hadoop book, Chapter 11 and exam28/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesReferences I[1]Andrew D Birrell and Bruce Jay Nelson, Implementing remoteprocedure calls, ACM Transactions on Computer Systems(TOCS) 2 (1984), no. 1, 39–59.[2]Iv’an de Prado Alonso, Mapreduce & hadoop api hadoopproblems/, 2012.[3]Brian Fung, Google is serious about taking on ontelecom-heres-why-itll-win/.[4]Pramod Kumar Gampa, Hdfs 08/hdfs-indetail.html, 2013.29/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesReferences II[5]Paul Hastings and Mathew Gibson, In visit to ftc, presidentoutlines broad privacy agenda, offers scant x?g 73df90aa-bcfa-4b49-9523-73c372f74695.[6]Jan Newmarch, Web services, es/tutorial.html.[7]Sreenivas Pasam, Apache hadoop, mputing/, 2011.[8]Tavish Srivastava, Hadoop beyond traditional mapreducesimplified, -mapreduce/, 2014.30/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesReferences III[9]BBC staff, Not in front of the telly: Warning over ’listening’tv, 10] Samsung staff, Samsung global privacy policy - privacy-SmartTV.html?CID AFL-hq-mul-0813-11000170, 2015.[11] White House staff, Big data: Seizing opportunities, preservingvalues interim progress report, Tech. report, White House,2015.[12] Tom White, Hadoop: The definitive guide, 3rd edition,O’Reilly Media, Inc., 2012.31/32

Miscellanea The Book Chapter 3 Chapter 4 Chapter 6 Chapter 8 Break Assignment #2 Exam Conclusion ReferencesReferences IV[13], Hadoop: The definitive guide, 4th edition, O’ReillyMedia, Inc., 2015.32/32

MiscellaneaThe Book Chapter 3 Chapter 4 Chapter 6 Chapter 8Break Assignment #2 ExamConclusionReferences MapReduce Features Counters Hadoop has its own counters, and supports global user de ned counters. Applications can access and increment counters Counters are global, across all Mappers and Reducers, so incrementing them can be expensive