Hadoop - Old.in2p3.fr

Transcription

Hadoopa quick walk-throughHadoop tutorial — école informatique IN2P3 2013Created by Foudil BRÉTEL / cc.in2p3.fr1 of 23

ResourcesTutorialsYahoo! tutorial — 2007 outdatedHadoop: The Definitive Guide, 3rd Edition , O'ReillyOther resourcesHadoop Operations , O'ReillyApache Hadoop documention (getting started, clustersetup, .)Cloudera demo VM , Cloudera managerWeb Data Management , Serge Abiteboul, Ioana Manolescu,Philippe Rigaux, Marie-Christine Rousset, Pierre SenellartHadoop tutorial — école informatique IN2P3 20132 of 23

Versions2005, Doug Cutting (Lucene) travaille sur NutchHadoop tutorial — école informatique IN2P3 20133 of 23

Hadoop tutorial — école informatique IN2P3 2013Versions4 of 23

Characteristics“ Hadoop is a large-scale, distributed,batch processing infrastructure ” parallelism, horizontal-scale1. simplified programming model2. distribution of work and data across machines (“datalocality”)“ Grid scheduling of computers can be done with existing systemssuch as Condor. But Condor does not automatically distributedata: a separate SAN must be managed in addition to thecompute cluster. Furthermore, collaboration between multiplecompute nodes must be managed with a communication systemsuch as MPI. This programming model is challenging to workwith and can lead to the introduction of subtle errors. ”Hadoop tutorial — école informatique IN2P3 20135 of 23

Vocabularyvertical scaling“scale up”add more power (CPU, RAM) to anexisting machinehorizontal scaling“scale out”add more machines *Hadoop tutorial — école informatique IN2P3 20136 of 23

MapReduce DesignJobs TasksTasks are run in isolationlimited implicit communicationreliability (e.g. node failures), speculative executionHadoop tutorial — école informatique IN2P3 20137 of 23

Vocabularyflat scalabilityHadoop tutorial — école informatique IN2P3 2013write program once, scale out *8 of 23

HDFS Designdata is distributed to all the nodeslarge data files are split into chunkschunks are replicated * across several machines (failureresistent)single namespace (unix-like)* backups for freeData is conceptually record-orientedHadoop tutorial — école informatique IN2P3 20139 of 23

HDFS ConsiderationsProsvery large amount of information, reliability, fast access,availability (many clients), Conslong sequential streaming reads (no random access)write once, read several times Hadoop tutorial — école informatique IN2P3 201310 of 23

HDFSchunks blocks of a fixed size (64MB *)chunks are stored on DataNodesThe catalog of chunks is a single service called NameNodeDataNodes store data on top of the OS file system (not rawdevices) *** 64MB 4KB (ext4, NTFS) list of blocks per file smaller** FS tuning welcome (ext4 options mount options, seeHadoop Operations, O'Reilly )Hadoop tutorial — école informatique IN2P3 201311 of 23

Hadoop tutorial — école informatique IN2P3 2013Commandeshadoophdfsmapred12 of 23

MapReduce Exercices0057332130 #99999#19500101 #0300#4 51317 # 028783 #FM-12 0171#99999V020320#1#N0072100450#1#CN010000 #1#N9-0128#1#-0139#1#10268#1#USAF weather station identifierWBAN weather station identifierobservation dateobservation timelatitude (degrees x 1000)longitude (degrees x 1000)elevation (meters)wind direction (degrees)quality codesky ceiling height (meters)quality codevisibility distance (meters)quality codeair temperature (degrees Celsius x 10)quality codedew point temperature (degrees Celsius x 10)quality codeatmospheric pressure (hectopascals x 10)quality codeHadoop tutorial — école informatique IN2P3 201313 of 23

Use casesLast.fm50 nodes, 300 cores, 100 TB diskUsageslogfile analysis, evaluation of A/B tests, ad hoc processing,charts generationAdoption motivationsdistributed filesystem redundant backups (web logs, userlistening data, ) at no extra cost.cheap commodity hardware scalabilityflexible, easy distributed computingopen source advantages (no cost, customizable)Hadoop tutorial — école informatique IN2P3 201314 of 23

Facebookworldwide second-largest Hadoop cluster2 PB disk ( 10TB/day), 2,400 cores, 9 TB RAMUsagesdaily and hourly reports/analyses about growth of the users, pageviews, average time spent on the site, advertisement performance, backend processing for site features (ex: suggestions)de facto long-term archival store for logsHadoop tutorial — école informatique IN2P3 201315 of 23

Facebook (continued)Adoption motivationsBeforeinitial data warehousing didn't scale (entirely on an Oracle instance)favourable preconception (Yahoo! using it, Google MapReduce)After prototypeability to use your favorite programming language (Hadoopstreaming)datasets published in one centralized data storecustomizable HiveHadoop tutorial — école informatique IN2P3 201316 of 23

Rackspace18 nodes, 22 TBUsagescale at aggregating and indexing (Lucene) Postfix andExchange logsAdoption motivationslog processing previously based on MySQL, but RDBMS sharding lose advantages of SQLHadoop tutorial — école informatique IN2P3 201317 of 23

HBaseGoogle's BigTable clone: distributed, versioned,column-oriented on top of HDFSProvides fast record lookups/updates for large tables(hundreds of millions rows)Built with very large scale and distribution in mind.Production users include Facebook (messaging system)Adobe, StumbleUpon, Twitter (people search), and groups atYahoo!“A Bigtable is a sparse, distributed, persistentmultidimensional sorted map. ”Hadoop tutorial — école informatique IN2P3 201318 of 23

lly, data saved by column familiesconceptualy, data organized in tablesrow/key entry point, byte array, unique,lexicographically sortedcolumn family rigid* columns stored in same HFile, sharesame options (e.g. compression)column (members) versioned (timestamp) memberscell {row, column family:column (, version)}.empty cells are not storedregions automatic shardingHadoop tutorial — école informatique IN2P3 201319 of 23

* must be define at table creation, cannot be addedHadoop tutorial — école informatique IN2P3 201320 of 23

multi-dimensional{"1" : {"A" :"B" :},"aaaaa""A" :"B" :},"xyz" :"A" :"B" :}"x","z": {"y","w"{"hello","there"}Hadoop tutorial — école informatique IN2P3 201321 of 23

multi-dimensional (continued){// ."aaaaa" : {"A" : {"foo" : "y","bar" : "d"},"B" : {"" : "w"}},"aaaab" : {"A" : {"foo" : "world","bar" : "domination"},"B" : {Row aaaaa has 3 rows: A:foo, A:bar and B:To know all columns in all rows full table scanHadoop tutorial — école informatique IN2P3 201322 of 23

THE ENDBY Foudil BRÉTEL / cc.in2p3.frHadoop tutorial — école informatique IN2P3 201323 of 23

Hadoop Operations, O'Reilly) Hadoop tutorial — école informatique IN2P3 2013 11 of 23. Commandes hadoop hdfs mapred Hadoop tutorial — école informatique IN2P3 2013 12 of 23. MapReduce Exercices 0057 332130 # USAF weather station identifier 99999 # WBAN weather station identifier