Introduction To Hadoop - KTH

Transcription

Introduction to HadoopID2210Jim Dowling

Large Scale Distributed ComputingIn #Nodes- BitTorrent (millions)- Peer-to-PeerIn #Instructions/sec- Teraflops, Petaflops, Exascale- Super-ComputingIn #Bytes stored- Facebook: 300 Petabytes (April 2014)*- HadoopIn #Bytes processed/time- Google processed 24 petabytes of data per day in 2013- Colossus, Spanner, BigQuery, BigTable, Borg, Omega, .*http://www.adweek.com/socialtimes/orcfile/434041

Where does Big Data Come From? On-line servicesPBs per day Scientific instrumentsPBs per minute Whole genome sequencing250 GB per person Internet-of-ThingsWill be lots!

What is Big Data?Small DataBig Data

Why is Big Data “hot”? Companies like Google and Facebook have shownhow to extract value from Big DataOrbitz looks for higher pricesfrom Safari users [WSJ’12]

Why is Big Data “hot”? Big Data helped Obama win the 2012 electionthrough data-driven decision making*Data said: middle-aged females like contests, dinners and who-helped-obama-win/

Why is Big Data Important in Science? In a wide array of academic fields, the ability toeffectively process data is superseding other moreclassical modes of research.“More data trumps better algorithms”**“The Unreasonable Effectiveness of Data” [Halevey et al 09]

4 Vs of Big Data Volume Velocity Variety Veracity/Variability/Value

A quick historical tour of data systems

Batch Sequential ProcessingScan SortIBM 082 Punch Card SorterNo Fault Tolerance

1960s

First Database Management SystemsCOBOLDBMS

Hierarchical and NetworkDatabase Management Systems

You had to know what data you want, and how to find it

Early DBMS’ were Disk-Aware

Codd's Relational ModelJust tell methe data you want,the system willfind it.

SystemRCREATE TABLE Students(id INT PRIMARY KEY,firstname VARCHAR(96),lastname VARCHAR(96));SELECT * FROM StudentsWHERE id 10;?ViewsRelationsStructured QueryLanguageIndexesDisk AccessMethodsDisk

Finding the Data using a Query OptimizerEach color represents a program in this plan diagramData Characteristics Change Each programproduces the sameresult for the Query. Each program hasdifferent performancecharacteristicsdepending on changesin the datacharacteristicsData Characteristics Change

What if I have lots of Concurrent Queries? Data Integrity using Transactions*ACIDAtomicity Consistency Isolation Durability*Jim Gray, ”The Transaction Concept: Virtues and Limitation”

In the 1990sData Read Rates Increased Dramatically

Distribute within a Data CenterMaster-Slave ReplicationData-location awareness is back:Clients read from slaves, write to master.Possibility of reading stale data.

In the 2000sData Write Rates Increased Dramatically

Unstructured Data explodesSource: IDC whitepaper. As the Economy contracts, the Digital Universe Explodes. 2009

Key-Value stores don’t do Big Data yet.Existing Big Data systems currently onlywork for a single Data Centre.**The usual Google Exception applies

Storage and Processing of Big Data

What is Apache Hadoop? Huge data sets and large files Gigabytes files, petabyte data sets Scales to thousands of nodes on commodity hardware No Schema Required Data can be just copied in, extract required columns later Fault tolerant Network topology-aware, Data Location-Aware Optimized for analytics: high-throughput file access

Hadoop (version 1)ApplicationMapReduceHadoop Filesystem

HDFS: Hadoop Filesystemwrite “/crawler/bot/jd.io/1”Under-replicated blocksName Data nodes216341523Data nodes2645

HDFS v2 ArchitectureActive-Standby Replication of NN LogAgreement on the Active NameNodeFaster Recovery - Cut the NN LogJournal NodesNameNodeZookeeper NodesStandbyNameNodeSnapshotNodeHDFS ClientDataNodes30

HopsFS ArchitectureNDBLoadBalancerHopsFS ClientNameNodesLeaderHDFS ClientDataNodes31

Processing Big Data

Big Data Processing with No Data low ManagerCompute Grid NodeJobThis doesn’t scale.Bandwidth is the bottleneck123256436356124145

MapReduce – Data LocalityJob(“/crawler/bot/jd.io/1”)submitJob rackerJobDN64TaskTrackerJobDN36R DN4145

MapReduce*1. Programming Paradigm2. Processing Pipeline (moving computation to data)*Dean et al, OSDI’04

MapReduce Programming Paradigmmap(record) - {(keyi, valuei), ., (keyl, valuel)}reduce((keyi, {valuek, ., valuey}) - output

MapReduce Programming Paradigm Also found in:Functional programming languagesMongoDBCassandra

Example: Building a Web Search Indexmap(url, doc) - {(termi, url),(termm, url)}reduce((term,{urlk,.,urly}) - (term, (posting list of url, count))

Example: Building a Web Search Indexmap( (“jd.io”, “A hipster website with news”))- {emit(“a”, “jd.io”),emit(“hipster”, “jd.io”),emit(“website”, “jd.io”),emit(“with”, “jd.io”),emit(“news”, “jd.io”)}

Example: Building a Web Search Indexmap( (“hn.io”, “Hacker hipster news”))- {emit(“hacker”, “hn.io”),emit(“hipster”, “hn.io”),emit(“news”, “hn.io”)}

Example: Building a Web Search Indexreduce( “hipster”, { “jd.io”, “hn.io” }) - ( “hipster”, ([“jd.io”, “hn.io”], 2))

Example: Building a Web Search Indexreduce( “website”, { “jd.io”}) - ( “website”, ([“jd.io”], 1))

Example: Building a Web Search Indexreduce( “news”, { “jd.io”, “hn.io” }) - ( “news”, ([“jd.io”, “hn.io”], 2))

Map PhaseMapReducemap(url, doc) - {(termi, url),(terml, url)}Mapper1Mapper6Mapper4Mapper3Mapper2Mapper51 2 31'2 5 66’4 3 64’3 5 63’1 2 42’1 4 55’DNDNDNDNDNDN

Shuffle PhaseMapReducegroup by termShuffle over the Network using a Z2’DN5’DN

Reduce PhaseMapReducereduce((term,{urlk,urly}) - (term, (posting list of url, M’-P’Q’-T’U’-Z’DNDNDNDNDNDN

Hadoop 2.xSingle Processing FrameworkBatch AppsHadoop 1.xMapReduce(resource mgmt, job scheduler,data processing)Multiple Processing FrameworksBatch, Interactive, Streaming Hadoop 2.xMapReduceOthers(data processing)(spark, mpi, giraph, etc)YARN(resource mgmt, job scheduler)HDFSHDFS(distributed storage)(distributed storage)

MapReduce and MPI as YARN Applications[Murthy et. al, Apache Hadoop YARN: Yet Another Resource Negotiator”, SOCC’13]

Data Locality in Hadoop v2

Limitations of MapReduce [Zaharia’11] MapReduce is based on an acyclic data flow fromstable storage to stable storage.- Slow writes data to HDFS at every stage in the pipeline Acyclic data flow is inefficient for applications thatrepeatedly reuse a working set of data:- Iterative algorithms (machine learning, graphs)- Interactive data mining tools (R, Excel, Python)MapInputReduceOutputMapMapReduce

Iterative Data Processing Frameworksval input TextFile(textInput)val words input.flatMap{ line line.split(” ”) }val counts words.groupBy{ word word }.count()val output counts.write (wordsOutput,RecordDataSinkFormat() )val plan new ScalaPlan(Seq(output))

Spark – Resiliant Distributed Datasets Allow apps to keep working sets in memory forefficient reuse Retain the attractive properties of MapReduce- Fault tolerance, data locality, scalabilityResilient distributed datasets (RDDs)- Immutable, partitioned collections of objects- Created through parallel transformations (map, filter,groupBy, join, ) on data in stable storage- Can be cached for efficient reuseActions on RDDs- Count, reduce, collect, save,

Example: Log MiningLoad error messages from a log into memory, theninteractively search for various patternslines Dresultserrors lines.filter( .startsWith(“ERROR”))messages errors.map( .split(‘\t’)(2))cachedMsgs messages.cache()tasksDriverCache 1WorkerBlock 1ActioncachedMsgs.filter( .contains(“foo”)).countCache 2cachedMsgs.filter( .contains(“bar”)).countWorker. . .Cache 3Result:tosearch1 TB datain 5-7 secinResult: scaledfull-textof Wikipediafor on-diskdata) 1(vssec170(vssec20 secfor on-diskdata)WorkerBlock 3Block 2

Apache Flink – DataFlow OperatorsFlinkMapIterateProjectReduceDelta rtex ReduceJoinSourceReduceSinkMap*Alexandrov et al.: “The Stratosphere Platform for Big Data Analytics,” VLDB Journal 5/2014

Built-in vs. driver-based loopingClientStepStepStepStepStepLoop outside the system,in driver programClientStepStepStepStepDataflows with feedbackedgesred.joinmapFlinkjoinStepIterative program lookslike many independentjobsSystem is iterationaware, can optimize thejob

Hadoop on the Cloud Cloud Computing traditionally separates storage andcomputation.AmazonOpenStackWeb ServicesNova (Compute)EC2Swift(ObjectElasticBlock Storage)StorageGlance (VMS3 Images)

Data Locality for Hadoop on the Cloud Cloud hardwareconfigurations shouldsupport data locality Hadoop’s original topologyawareness breaks Placement of 1 VMcontaining block replicas forthe same file on the samephysical host increasescorrelated failures VMWare introduced aNodeGroup aware topology HADOOP-8468

Conclusions Hadoop is the open-source enabling technology forBig Data YARN is rapidly becoming the operating system forthe Data Center Apache Spark and Flink are in-memory processingframeworks for Hadoop

References Dean et. Al, “MapReduce: Simplified Data Processingon Large Clusters”, OSDI’04. Schvachko, “HDFS Scalability: The limits to growth”,Usenix, :login, April 2010. Murthy et al, “Apache Hadoop YARN: Yet AnotherResource Negotiator”, SOCC’13. “Processing a Trillion Cells per Mouse Click”,VLDB’12

Data Locality for Hadoop on the Cloud Cloud hardware configurations should support data locality Hadoop'soriginal topology awareness breaks Placement of 1 VM containing block replicas for the same file on the same physical host increases correlated failures VMWare introduced a NodeGroup aware topology HADOOP-8468