Apache Hadoop - IRISA

Transcription

1Apache HadoopAlexandru Costan

Big Data LandscapeNo one-size-fits-all solution:SQL, NoSQL, MapReduce, No standard, except Hadoop2

Outline What is Hadoop ?Who uses it ?ArchitectureHDFSMapReduceOpen IssuesExamples3

44

What is Hadoop?Hadoop is a top-level Apache project Open source implementation of MapReduce Developed in JavaPlatform for data storage and processing ScalableFault tolerantDistributedAny type of complex data5

Why?6

What for? Advocated by industry’s premier Web players(Google, Yahoo!, Microsoft, Facebook) as theengine to power the cloud. Used for: Batch data processing, not real-time / user facing:web searchLog processingDocument analysis and indexingWeb graphs and crawlingHighly parallel data intensive distributed applications7

Who uses Hadoop?88

Components: the Hadoop stack9

HDFSDistributed storage system Files are divided into large blocks (128 MB)Blocks are distributed across the clusterBlocks are replicated to help against hardware failureData placement is exposed so that computation can bemigrated to data Master / Slave architectureNotable differences from mainstream DFSwork Single ‘storage compute’ cluster vs. separate clusters Simple I/O centric API10

HDFS ArchitectureHDFS Master: NameNode Manages all file system metadata in memory: List of filesFor each file name: a set of blocksFor each block: a set of DataNodesFile attributes (creation time, replication factor) Controls read/write access to files Manages block replication Transaction log: register file creation, deletion,etc.11

HDFS ArchitectureHDFS Slaves: DataNodes A DataNode is a block server Stores data in the local file system (e.g. ext3) Stores meta-data of a block (e.g. CRC) Serves data and meta-data to Clients Block report Periodically sends a report of all existing blocks tothe NameNode Pipelining of data Forwards data to other specified DataNodes Perform replication tasks upon instruction byNameNode Rack-aware12

HDFS Architecture13

Fault tolerance in HDFSNameNode uses heartbeats to detectDataNode failures: Once every 3 secondsChooses new DataNodes for new replicasBalances disk usageBalances communication traffic to DataNodesMultiple copies of a block are stored: Default replication: 3 Copy #1 on another node on the same rack Copy #2 on another node on a different rack14

Data CorrectnessUse checksums to validate data CRC32File creation Client computes checksum per 512 byte DataNode stores the checksumFile access Client retrieves the data and checksum fromDataNode If validation fails, client tries other replicas15

NameNode failures16A single point of failureTransaction Log stored in multiple directories A directory on the local file system A directory on a remote file system (NFS/CIFS)The Secondary NameNode holds a backupof the NameNode data On the same machine LNeed to develop a really highly availablesolution!

Data pipeliningClient retrieves a list of DataNodes on whichto place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the second The second DataNode forwards the data to the thirdDataNode in the pipeline When all replicas are written, the client moves on towrite the next block in file17

User interfaceCommands for HDFS dfsdfsdfsdfsdfs–put /localsrc /dest-mkdir /foodir-cat /foodir/myfile.txt-rm /foodir myfile.txt-cp /src /destCommands for HDFS Administrator– hadoop dfsadmin -report– hadoop dfsadmin -decommission datanodenameWeb Interface– http://host:port/dfshealth.jsp18

Hadoop MapReduceParallel processing for large datasetsRelies on HDFSMaster-Slave architecture: Job Tracker Task Trackers19

Hadoop MapReduce ArchitectureMap-Reduce Master: JobTracker Accepts MapReduce jobs submitted by users Assigns Map and Reduce tasks toTaskTrackers Monitors task and TaskTracker status Re-executes tasks upon failure20

Hadoop MapReduce ArchitectureMap-Reduce Slaves: TaskTrackers Run Map and Reduce tasks upon instructionfrom the JobTracker Manage storage and transmission ofintermediate output Generic Reusable Framework supportingpluggable user code (file system, I/O format)21

Putting everything together:HDFS and MapReduce deployment22

Hadoop MapReduce Client Define Mapper and Reducer classes anda “launching” program Language support Java C Python Special case: Maps only23

Zoom on the Map phase24

Zoom on the Reduce Phase25

Data locality Data locality is exposed in the map taskscheduling put tasks where data is Data are replicated: Fault tolerance Performance: divide the work among nodes JobTracker schedules map tasksconsidering: Node-aware Rack-aware Non-local map tasks26

Fault tolerance TaskTrackers send heartbeats to theJobTracker Once every 3 seconds Node is labled as failed if no heartbeat is recieved for adefined expiry time (default: 10 minutes) Re-execute all the ongoing and completedtasks Need to develop a more efficient policy toprevent re-executing completed tasks(storing this data in HDFS)!27

Speculation in HadoopSlow nodes (stragglers) à run backup tasksNode 1Node 228

Life cycle of Map/Reduce tasks2929

Open Issues - 1All the metadata are handled through onesingle Master in HDFS (Name Node)Performs bad when: Handling many files Heavy concurrency30

Open Issues - 2Data locality is crucial for Hadoop’sperformanceHow can we expose data-locality of Hadoopin the Cloud efficiently?Hadoop in the Cloud: Unaware of network topology Node-aware or non-local map tasks31

Open Issues - 232Data locality in the CloudNode1Node2Node3Node4Node5Node612 5 81 6 72 3 54 7 1012128116 9 10113 49Empty nodeEmpty nodeEmpty nodeThe simplicity of map tasks scheduling leads to non-localmaps execution (25%)

Open Issues - 3Data Skew The current Hadoop hash partitioning works well whenthe keys are equally frequent and uniformly stored in thedata nodes In the presence of partitioning skew: Variation in Intermediate Keys frequencies Variation in Intermediate Keys distribution amongdifferent Data Nodes Native blindly hash-partitioning is inadequate and willlead to: Network congestion Unfairness in reducers inputs - Reduce computationskew Performance degradation33

Open Issues - 3Data Node134Data Node1Data 6K4K5K5K5K5K5Hash code: (Intermediate-key) Modulo ReduceIDK1K2K3K4K5K6Data Node1Data Node2Data Node3Total Out Data Transfer111518Reduce Input29178Total44/54

Job scheduling in Hadoop Considerations Job priority: deadlineCapacity: cluster resources available, resources needed forjob FIFO The first job to arrive at a JobTracker is processed first Capacity Scheduling Organizes jobs into queuesQueue shares as %’s of cluster Fair Scheduler Group jobs into “pools”Divide excess capacity evenly between poolsDelay scheduler to expose data locality35

Hadoop at work!Cluster of machines running Hadoop at Yahoo! (credit: Yahoo!)36

Running HadoopMultiple options: On your local machine (standalone orpseudo distributed) Local with a Virtual Machine On the cloud (i.e. Amazon EC2) In your own datacenter (e.g. Grid5000)37

Word Count Example In Hadoop38

BibliographyHDFS hdfs design.htmlHadoop /39

Data locality is crucial for Hadoop's performance How can we expose data-locality of Hadoop in the Cloud efficiently? Hadoop in the Cloud: Unaware of network topology Node-aware or non-local map tasks . Open Issues - 2 32 Data locality in the Cloud