Hadoop MapReduce - 123seminarsonly

Transcription

Hadoop MapReduceFelipe Meneses BessonIME-USP, BrazilJuly 7, 2010

Agenda What is Hadoop? Hadoop Subprojects MapReduce HDFS Development and tools2

What is Hadoop?A framework for large-scale data processing(Tom White, 2009): Project of Apache Software Foundation Most written in Java Inspired in Google MapReduce andGFS (Google File System)3

A brief history 2004: Google published a paper that introduced MapReduce andGFS as a alternative to handle the volume of data to be processed2005: Doug Cutting integrated MapReduce in the Hadoop2006: Doug Cutting joins Yahoo!2008: Cloudera¹ was founded2009: Hadoop cluster sort 100 terabyte in 173 minutes (on 3400nodes)²Nowadays, Cloudera company is an active contributor to the Hadoopproject and provide Hadoop consulting and commercial products.[1]Cloudera: http://www.cloudera.com[2] Sort Benchmark: http://sortbenchmark.org/4

Hadoop Characteristics A scalable and reliable system for shared storage andanalyses.It automatically handles data replication and nodefailureIt does the hard work – developer can focus onprocessing data logicEnable applications to work of petabytes of data inparallel5

Who's using HadoopSource: Hadoop wiki, September 20096

Hadoop SubprojectsApache Hadoop is a collection of relatedsubprojects that fall under the umbrella ofinfrastructure for distributed computing.All projects are hosted by the Apache Software Foundation.7

MapReduceMapReduce is a programming model and anassociated implementation for processing andgenerating large data sets (Jeffrey Dean and SanjayGhemawat, 2004) Based on a functional programming model A batch data processing system A clean abstraction for programmers Automatic parallelization & distribution Fault-tolerance8

MapReduceProgramming modelUsers implement the interface of two functions:map (in key, in value) - (out key, intermediate value) listreduce (out key, intermediate value list) - out value list9

MapReduceMap FunctionInput:–Records from some data source (e.g., lines of files, rowsof a databases, ) are associated in the (key, value) pair Example: (filename, content)Output:–One or more intermediate values in the (key, value) format Example: (word, number of occurrences)10

MapReduceMap Functionmap (in key, in value) (out key, intermediate value) listSource: (Cloudera, 2010)11

MapReduceMap FunctionExample:map (k, v):if (isPrime(v)) then emit (k, v)(“foo”, 7)(“foo”, 7)(“test, 10)(nothing)12

MapReduceReduce functionAfter map phase is over, all the intermediate values for agiven output key are combined together into a listInput:–Intermediate values Example: (“A”, [42, 100, 312])Output:–usually only one final value per key Example: (“A”, 454)13

MapReduceReduce Functionreduce (out key, intermediate value list) out value listSource: (Cloudera, 2010)14

MapReduceReduce FunctionExample:reduce (k, vals):sum 0foreach int v in vals:sum vemit (k, sum)(“A”, [42, 100, 312])(“B”, [12, 6, -2])(“A”, 454)(“B”, 16)15

MapReduceTerminologyJob: unit of work that the client wants to be performed–Input data MapReduce program configurationinformationTask: part of the job–map and reduce tasksJobtracker: node that coordinates all the jobs in thesystem by scheduling tasks to run on tasktrackers16

MapReduceTerminologyTasktracker: nodes that run tasks and send progressreports to the jobtrackerSplit: fixed-size piece of the input data17

MapReduceDataFlow18Source: (Cloudera, 2010)

MapReduceReal Examplemap (String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");19

MapReduceReal Examplereduce(String key, Iterator values):// key: a word// values: a list of countsint result 0;for each v in values:result ParseInt(v);Emit(AsString(result));20

MapReduceCombiner function Compress the intermediate values Run locally on mapper nodes after map phase It is like a “mini-reduce” Used to save bandwidth before sendingdata to the reducer21

MapReduceCombiner FunctionApplied in a mapper machineSource: (Cloudera, 2010)22

HDFSHadoop Distributed Filesystem Inspired on GFSDesigned to work with very large filesRun on commodity hardwareStreaming data accessReplication and locality23

HDFSNodes A Namenode (the master)– Manages the filesystem namespace– Knows all the blocks locationDatanodes (workers)– Keep blocks of data– Report back to namenode its lists of blocksperiodically24

HDFSDuplicationInput data is copied into HDFS is split into blocksEach data blocks is replicated to multiple machines25

HDFSMapReduce Data flowSource: (Tom White, 2009)26

Hadoop filesystems27Source: (Tom White, 2009)

Development and ToolsHadoop operation modesHadoop supports three modes of operation: Standalone Pseudo-distributed Fully-distributedMore pts/hadooptdg/installing-apache-hadoop.html28

Development and ToolsJava example29

Development and ToolsJava example30

Development and ToolsJava example31

Development and ToolsGuidelines to get startedThe basic steps for running a Hadoop job are: Compile your job into a JAR file Copy input data into HDFS Execute hadoop passing the jar and relevant args Monitor tasks via Web interface (optional) Examine output when job is complete32

Development and ToolsApi, tools and trainingDo you want to use a scripting language? http://wiki.apache.org/hadoop/HadoopStreaming ng.htmlEclipse plugin for MapReduce development http://wiki.apache.org/hadoop/EclipsePlugInHadoop training (videos, exercises, ) aining/33

BibliographyHadoop – The definitive guideTom White (2009). Hadoop – The Definitive Guide. O'Reilly, San Francisco, 1st EditionGoogle ArticleJeffrey Dean and Sanjay Ghemawat (2004). MapReduce: Simplified Data Processing onLarge Clusters. Available on: Hadoop In 45 Minutes or LessTom Wheeler. Large-Scale Data Processing for Everyone. Available bda lounge hadoop a Videos and Traininghttp://www.cloudera.com/resources/?type Training34

4 A brief history 2004: Google published a paper that introduced MapReduce and GFS as a alternative to handle the volume of data to be processed 2005: Doug Cutting integrated MapReduce in the Hadoop 2006: Doug Cutting joins Yahoo! 2008: Cloudera¹ was founded 2009: Hadoop cluster sort 100 terabyte in 173 minutes (on 3400 nodes)² Nowadays, Cloudera company is an active contributor to the Hadoop