Transcription
Introduction to HadoopID2210Jim Dowling
Large Scale Distributed ComputingIn #Nodes- BitTorrent (millions)- Peer-to-PeerIn #Instructions/sec- Teraflops, Petaflops, Exascale- Super-ComputingIn #Bytes stored- Facebook: 300 Petabytes (April 2014)*- HadoopIn #Bytes processed/time- Google processed 24 petabytes of data per day in 2013- Colossus, Spanner, BigQuery, BigTable, Borg, Omega, .*http://www.adweek.com/socialtimes/orcfile/434041
Where does Big Data Come From? On-line servicesPBs per day Scientific instrumentsPBs per minute Whole genome sequencing250 GB per person Internet-of-ThingsWill be lots!
What is Big Data?Small DataBig Data
Why is Big Data “hot”? Companies like Google and Facebook have shownhow to extract value from Big DataOrbitz looks for higher pricesfrom Safari users [WSJ’12]
Why is Big Data “hot”? Big Data helped Obama win the 2012 electionthrough data-driven decision making*Data said: middle-aged females like contests, dinners and who-helped-obama-win/
Why is Big Data Important in Science? In a wide array of academic fields, the ability toeffectively process data is superseding other moreclassical modes of research.“More data trumps better algorithms”**“The Unreasonable Effectiveness of Data” [Halevey et al 09]
4 Vs of Big Data Volume Velocity Variety Veracity/Variability/Value
A quick historical tour of data systems
Batch Sequential ProcessingScan SortIBM 082 Punch Card SorterNo Fault Tolerance
1960s
First Database Management SystemsCOBOLDBMS
Hierarchical and NetworkDatabase Management Systems
You had to know what data you want, and how to find it
Early DBMS’ were Disk-Aware
Codd's Relational ModelJust tell methe data you want,the system willfind it.
SystemRCREATE TABLE Students(id INT PRIMARY KEY,firstname VARCHAR(96),lastname VARCHAR(96));SELECT * FROM StudentsWHERE id 10;?ViewsRelationsStructured QueryLanguageIndexesDisk AccessMethodsDisk
Finding the Data using a Query OptimizerEach color represents a program in this plan diagramData Characteristics Change Each programproduces the sameresult for the Query. Each program hasdifferent performancecharacteristicsdepending on changesin the datacharacteristicsData Characteristics Change
What if I have lots of Concurrent Queries? Data Integrity using Transactions*ACIDAtomicity Consistency Isolation Durability*Jim Gray, ”The Transaction Concept: Virtues and Limitation”
In the 1990sData Read Rates Increased Dramatically
Distribute within a Data CenterMaster-Slave ReplicationData-location awareness is back:Clients read from slaves, write to master.Possibility of reading stale data.
In the 2000sData Write Rates Increased Dramatically
Unstructured Data explodesSource: IDC whitepaper. As the Economy contracts, the Digital Universe Explodes. 2009
Key-Value stores don’t do Big Data yet.Existing Big Data systems currently onlywork for a single Data Centre.**The usual Google Exception applies
Storage and Processing of Big Data
What is Apache Hadoop? Huge data sets and large files Gigabytes files, petabyte data sets Scales to thousands of nodes on commodity hardware No Schema Required Data can be just copied in, extract required columns later Fault tolerant Network topology-aware, Data Location-Aware Optimized for analytics: high-throughput file access
Hadoop (version 1)ApplicationMapReduceHadoop Filesystem
HDFS: Hadoop Filesystemwrite “/crawler/bot/jd.io/1”Under-replicated blocksName Data nodes216341523Data nodes2645
HDFS v2 ArchitectureActive-Standby Replication of NN LogAgreement on the Active NameNodeFaster Recovery - Cut the NN LogJournal NodesNameNodeZookeeper NodesStandbyNameNodeSnapshotNodeHDFS ClientDataNodes30
HopsFS ArchitectureNDBLoadBalancerHopsFS ClientNameNodesLeaderHDFS ClientDataNodes31
Processing Big Data
Big Data Processing with No Data low ManagerCompute Grid NodeJobThis doesn’t scale.Bandwidth is the bottleneck123256436356124145
MapReduce – Data LocalityJob(“/crawler/bot/jd.io/1”)submitJob rackerJobDN64TaskTrackerJobDN36R DN4145
MapReduce*1. Programming Paradigm2. Processing Pipeline (moving computation to data)*Dean et al, OSDI’04
MapReduce Programming Paradigmmap(record) - {(keyi, valuei), ., (keyl, valuel)}reduce((keyi, {valuek, ., valuey}) - output
MapReduce Programming Paradigm Also found in:Functional programming languagesMongoDBCassandra
Example: Building a Web Search Indexmap(url, doc) - {(termi, url),(termm, url)}reduce((term,{urlk,.,urly}) - (term, (posting list of url, count))
Example: Building a Web Search Indexmap( (“jd.io”, “A hipster website with news”))- {emit(“a”, “jd.io”),emit(“hipster”, “jd.io”),emit(“website”, “jd.io”),emit(“with”, “jd.io”),emit(“news”, “jd.io”)}
Example: Building a Web Search Indexmap( (“hn.io”, “Hacker hipster news”))- {emit(“hacker”, “hn.io”),emit(“hipster”, “hn.io”),emit(“news”, “hn.io”)}
Example: Building a Web Search Indexreduce( “hipster”, { “jd.io”, “hn.io” }) - ( “hipster”, ([“jd.io”, “hn.io”], 2))
Example: Building a Web Search Indexreduce( “website”, { “jd.io”}) - ( “website”, ([“jd.io”], 1))
Example: Building a Web Search Indexreduce( “news”, { “jd.io”, “hn.io” }) - ( “news”, ([“jd.io”, “hn.io”], 2))
Map PhaseMapReducemap(url, doc) - {(termi, url),(terml, url)}Mapper1Mapper6Mapper4Mapper3Mapper2Mapper51 2 31'2 5 66’4 3 64’3 5 63’1 2 42’1 4 55’DNDNDNDNDNDN
Shuffle PhaseMapReducegroup by termShuffle over the Network using a Z2’DN5’DN
Reduce PhaseMapReducereduce((term,{urlk,urly}) - (term, (posting list of url, M’-P’Q’-T’U’-Z’DNDNDNDNDNDN
Hadoop 2.xSingle Processing FrameworkBatch AppsHadoop 1.xMapReduce(resource mgmt, job scheduler,data processing)Multiple Processing FrameworksBatch, Interactive, Streaming Hadoop 2.xMapReduceOthers(data processing)(spark, mpi, giraph, etc)YARN(resource mgmt, job scheduler)HDFSHDFS(distributed storage)(distributed storage)
MapReduce and MPI as YARN Applications[Murthy et. al, Apache Hadoop YARN: Yet Another Resource Negotiator”, SOCC’13]
Data Locality in Hadoop v2
Limitations of MapReduce [Zaharia’11] MapReduce is based on an acyclic data flow fromstable storage to stable storage.- Slow writes data to HDFS at every stage in the pipeline Acyclic data flow is inefficient for applications thatrepeatedly reuse a working set of data:- Iterative algorithms (machine learning, graphs)- Interactive data mining tools (R, Excel, Python)MapInputReduceOutputMapMapReduce
Iterative Data Processing Frameworksval input TextFile(textInput)val words input.flatMap{ line line.split(” ”) }val counts words.groupBy{ word word }.count()val output counts.write (wordsOutput,RecordDataSinkFormat() )val plan new ScalaPlan(Seq(output))
Spark – Resiliant Distributed Datasets Allow apps to keep working sets in memory forefficient reuse Retain the attractive properties of MapReduce- Fault tolerance, data locality, scalabilityResilient distributed datasets (RDDs)- Immutable, partitioned collections of objects- Created through parallel transformations (map, filter,groupBy, join, ) on data in stable storage- Can be cached for efficient reuseActions on RDDs- Count, reduce, collect, save,
Example: Log MiningLoad error messages from a log into memory, theninteractively search for various patternslines Dresultserrors lines.filter( .startsWith(“ERROR”))messages errors.map( .split(‘\t’)(2))cachedMsgs messages.cache()tasksDriverCache 1WorkerBlock 1ActioncachedMsgs.filter( .contains(“foo”)).countCache 2cachedMsgs.filter( .contains(“bar”)).countWorker. . .Cache 3Result:tosearch1 TB datain 5-7 secinResult: scaledfull-textof Wikipediafor on-diskdata) 1(vssec170(vssec20 secfor on-diskdata)WorkerBlock 3Block 2
Apache Flink – DataFlow OperatorsFlinkMapIterateProjectReduceDelta rtex ReduceJoinSourceReduceSinkMap*Alexandrov et al.: “The Stratosphere Platform for Big Data Analytics,” VLDB Journal 5/2014
Built-in vs. driver-based loopingClientStepStepStepStepStepLoop outside the system,in driver programClientStepStepStepStepDataflows with feedbackedgesred.joinmapFlinkjoinStepIterative program lookslike many independentjobsSystem is iterationaware, can optimize thejob
Hadoop on the Cloud Cloud Computing traditionally separates storage andcomputation.AmazonOpenStackWeb ServicesNova (Compute)EC2Swift(ObjectElasticBlock Storage)StorageGlance (VMS3 Images)
Data Locality for Hadoop on the Cloud Cloud hardwareconfigurations shouldsupport data locality Hadoop’s original topologyawareness breaks Placement of 1 VMcontaining block replicas forthe same file on the samephysical host increasescorrelated failures VMWare introduced aNodeGroup aware topology HADOOP-8468
Conclusions Hadoop is the open-source enabling technology forBig Data YARN is rapidly becoming the operating system forthe Data Center Apache Spark and Flink are in-memory processingframeworks for Hadoop
References Dean et. Al, “MapReduce: Simplified Data Processingon Large Clusters”, OSDI’04. Schvachko, “HDFS Scalability: The limits to growth”,Usenix, :login, April 2010. Murthy et al, “Apache Hadoop YARN: Yet AnotherResource Negotiator”, SOCC’13. “Processing a Trillion Cells per Mouse Click”,VLDB’12
Data Locality for Hadoop on the Cloud Cloud hardware configurations should support data locality Hadoop'soriginal topology awareness breaks Placement of 1 VM containing block replicas for the same file on the same physical host increases correlated failures VMWare introduced a NodeGroup aware topology HADOOP-8468