Introduction To Cloud Computing - Semantic Scholar

Transcription

Introduction to cloudcomputingJiaheng LuDepartment of Computer ScienceRenmin University of Chinawww.jiahenglu.net

Hadoop/HiveOpen-Source Solution for Huge Data Sets

Data Scalability Problems Search Engine Log Processing / Data Warehousing 10KB / doc * 20B docs 200TBReindex every 30 days: 200TB/30days 6 TB/day0.5KB/events * 3B pageview events/day 1.5TB/day100M users * 5 events * 100 feed/event * 0.1KB/feed 5TB/dayMultipliers: 3 copies of data, 3-10 passes of rawdataProcessing Speed (Single Machine) 2-20MB/second * 100K seconds/day 0.2-2 TB/day

Google’s Solution Google File System – SOSP’2003Map-Reduce – OSDI’2004Sawzall – Scientific Programming Journal’2005Big Table – OSDI’2006Chubby – OSDI’2006

Open Source World’s Solution Google File System – Hadoop Distributed FSMap-Reduce – Hadoop Map-ReduceSawzall – Pig, Hive, JAQLBig Table – Hadoop HBase, CassandraChubby – Zookeeper

Simplified Search EngineArchitectureSpiderBatch Processing Systemon top of HadoopInternetSearch Log StorageRuntimeSE Web Server

Simplified Data WarehouseArchitectureBusinessIntelligenceBatch Processing Systemon top fo HadoopDomain Knowledge View/Click/Events Log StorageDatabaseWeb Server

Hadoop History Jan 2006 – Doug Cutting joins YahooFeb 2006 – Hadoop splits out of Nutch and Yahoo startsusing it.Dec 2006 – Yahoo creating 100-node Webmap withHadoopApr 2007 – Yahoo on 1000-node clusterJan 2008 – Hadoop made a top-level Apache projectDec 2007 – Yahoo creating 1000-node Webmap withHadoopSep 2008 – Hive added to Hadoop as a contrib project

Hadoop Introduction Open Source Apache Project http://hadoop.apache.org/ Book: lWritten in Java Does work with other languagesRuns on Linux, Windows and moreCommodity hardware with high failure rate

Current Status of Hadoop Largest Cluster Used by 40 companies / universities overthe world 2000 nodes (8 cores, 4TB disk)Yahoo, Facebook, etcCloud Computing Donation from Google and IBMStartup focusing on providing services forhadoop Cloudera

Hadoop Components Hadoop Distributed File System (HDFS)Hadoop Map-ReduceContributes Hadoop StreamingPig / JAQL / HiveHBase

Hadoop Distributed File System

Goals of HDFS Very Large Distributed File System Convenient Cluster Management 10K nodes, 100 million files, 10 PBLoad balancingNode failuresCluster expansionOptimized for Batch Processing Allow move computation to dataMaximize throughput

HDFS Details Data Coherency Files are broken up into blocks Write-once-read-many access modelClient can only append to existing filesTypically 128 MB block sizeEach block replicated on multiple DataNodesIntelligent Client Client can find location of blocksClient accesses data directly from DataNode

HDFS User Interface Java APICommand Linehadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir myfile.txt hadoop dfsadmin -report hadoop dfsadmin -decommission datanodename Web Interface http://host:port/dfshealth.jsp

Hadoop Map-Reduce andHadoop Streaming

Hadoop Map-Reduce Introduction Map/Reduce works like a parallel Unix pipeline: Framework does inter-node communication cat input grep sort uniq -c cat outputInput Map Shuffle & Sort Reduce OutputFailure recovery, consistency etcLoad balancing, scalability etcFits a lot of batch processing applications Log processingWeb index building

(Simplified) Map Reduce ReviewMachine 1 k1, v1 k2, v2 k3, v3 nk1, nv1 nk2, nv2 nk3, nv3 LocalMap nk1, nv1 nk3, nv3 nk1, nv6 nk1, nv1 nk1, nv6 nk3, nv3 LocalSortGlobalShuffle nk1, 2 nk3, 1 LocalReduceMachine 2 k4, v4 k5, v5 k6, v6 nk2, nv4 nk2, nv5 nk1, nv6 nk2, nv4 nk2, nv5 nk2, nv2 nk2, nv4 nk2, nv5 nk2, nv2 nk2, 3

Physical Flow

Example Code

Hadoop Streaming Allow to write Map and Reduce functions in anylanguages Hadoop Map/Reduce only accepts JavaExample: Word Count hadoop streaming-input /user/zshao/articles-mapper „tr “ ” “\n”‟-reducer „uniq -c„-output /user/zshao/-numReduceTasks 32

Hive - SQL on top of Hadoop

Map-Reduce and SQL Map-Reduce is scalable SQL has a huge user baseSQL is easy to codeSolution: Combine SQL and Map-Reduce Hive on top of Hadoop (open source)Aster Data (proprietary)Green Plum (proprietary)

Hive A database/data warehouse on top ofHadoop Rich data types (structs, lists and maps)Efficient implementations of SQL filters, joins and groupby‟s on top of map reduceAllow users to access Hive data withoutusing HiveLink: http://svn.apache.org/repos/asf/hadoop/hive/trunk/

Dealing with Structured Data Type system Generic (De)Serialization Interface (SerDe) To recursively list schemaTo recursively access fields within a row objectSerialization families implement interface Primitive typesRecursively build up using Composition/Maps/ListsThrift DDL based SerDeDelimited text based SerDeYou can write your own SerDeSchema Evolution

MetaStore Stores Table/Partition properties: Thrift API Table schema and SerDe libraryTable Location on HDFSLogical Partitioning keys and typesOther informationCurrent clients in Php (Web Interface), Python (old CLI), Java(Query Engine and CLI), Perl (Tests)Metadata can be stored as text files or even in a SQLbackend

Hive CLI DDL: Browsing: create table/drop table/rename tablealter table add columnshow tablesdescribe tablecat tableLoading DataQueries

Web UI for Hive MetaStore UI: Browse and navigate all tables in the systemComment on each table and each columnAlso captures data dependenciesHiPal: Interactively construct SQL queries by mouse clicksSupport projection, filtering, group by and joiningAlso support

Hive Query Language Philosophy SQLMap-Reduce with custom scripts (hadoop streaming)Query Operators ProjectionsEqui-joinsGroup bySamplingOrder By

Hive QL – Custom Map/ReduceScripts Extended SQL: FROM ( FROM pv users MAP pv users.userid, pv users.date USING 'map script' AS (dt, uid) CLUSTER BY dt) mapINSERT INTO TABLE pv users reduced REDUCE map.dt, map.uid USING 'reduce script' AS (date, count);Map-Reduce: similar to hadoop streaming

Hive ArchitectureMap ReduceWeb UIMgmt, etcHive CLIBrowsingQueriesDDLHive QLMetaStoreParserPlannerExecutionSerDeThrift APIThrift Jute JSONHDFS

Hive QL – Joinpage viewpageid121 userid111111222time9:08:019:08:139:08:14pv usersuseruser age genderXid111 25 female222 32male page ageid125225SQL:INSERT INTO TABLE pv usersSELECT pv.pageid, u.ageFROM page view pv JOIN user u ON (pv.userid u.userid);132

Hive QL – Join in Map Reducepage 119:08:14111222Mapuserkeyuser age gender111id222111 25 female222 32malevalue 1,1 1,2 1,1 value 2,25 2,32 key value111 1,1 111 1,2 ShuffleSort111 2,25 Reducekey value222 1,1 222 2,32

Hive QL – Group Bypv userspage ageid12522512 3225SQL: INSERT INTO TABLE pageid age sumSELECT pageid, age, count(1)FROM pv usersGROUP BY pageid, age;pageid age sumpageid12age2525Count121321

Hive QL – Group By in MapReducepv usersppage ageid125225Mappage ageid132225key value 1,215 2,215 key value 1,312 2,215 ShuffleSortkey valuepa 1,215 1,31Reduce2 key value 2,215 2,215 pa

Hive QL – Group By with Distinctpage viewpageid1212 agei count distinct udserid12219:08:20SQL SELECT pageid, COUNT(DISTINCT userid)FROM page view GROUP BY pageid

Hive QL – Group By with Distinctin Map Reducepage viewpagei useridd11112111time9:08:019:08:13pagei useri timedd1222 9:08:142111 9:08:20ShuffleandSortkey 1,111 1,222 vpagei coundt12Reducekey 2,111 2,111 Shuffle key is a prefix of the sort key.vpagei coundt21

Hive QL: Order Bypage viewpagei useri timedd2111 9:08:13Shuffle1111 9:08:01andpagei useri timeSortdd2111 9:08:201222 9:08:14Shuffle randomly.pagei userikeyvdd 1,111 9:08:0111112111 2,111 9:08:13 Reducekeyv 1,222 9:08:1 4 2,111 9:08:20pagei useridd12222111t9:9:t99

Hive OptimizationsEfficient Execution of SQL on top of Map-Reduce

(Simplified) Map Reduce RevisitMachine 1 k1, v1 k2, v2 k3, v3 nk1, nv1 nk2, nv2 nk3, nv3 LocalMap nk1, nv1 nk3, nv3 nk1, nv6 GlobalShuffle nk1, nv1 nk1, nv6 nk3, nv3 LocalSort nk1, 2 nk3, 1 LocalReduceMachine 2 k4, v4 k5, v5 k6, v6 nk2, nv4 nk2, nv5 nk1, nv6 nk2, nv4 nk2, nv5 nk2, nv2 nk2, nv4 nk2, nv5 nk2, nv2 nk2, 3

Merge Sequential Map Reduce JobsAkey av1 111ABBkey bv1 222 SQL: Map Reduceke av bvy1 111 222Ckey cv1 333ABCMap Reduceke av bv cvy1 111 222 333FROM (a join b on a.key b.key) join c on a.key c.keySELECT

Share Common Read Operationspage ageid125232Map Reducepage couidnt1121 Extended SQL page ageid125232Map Reduceage count251321FROM pv usersINSERT INTO TABLEpv pageid sum SELECT pageid, count(1) GROUP BY pageidINSERT INTO TABLE pv age sum SELECT age, count(1) GROUP BY age;

Load Balance Problempv userspageid11age2525121253225pageid age partial sumpageid age sumMap-Reducepageid121agag coucouee ntnt2525 243232 11252

Map-side Aggregation / CombinerMachine 1 k1, v1 k2, v2 k3, v3 male, 343 female, 128 LocalMap male, 343 male, 123 GlobalShuffle male, 343 male, 123 LocalSort male, 466 LocalReduceMachine 2 k4, v4 k5, v5 k6, v6 male, 123 female, 244 female, 128 female, 244 female, 128 female, 244 female, 372

Query Rewrite Predicate Push-down select * from (select * from t) where col1 „2008‟;Column Pruning select col1, col3 from (select * from t);

Hadoop History Jan 2006 -Doug Cutting joins Yahoo Feb 2006 -Hadoop splits out of Nutch and Yahoo starts using it. Dec 2006 -Yahoo creating 100-node Webmap with Hadoop Apr 2007 -Yahoo on 1000-node cluster Jan 2008 -Hadoop made a top-level Apache project Dec 2007 -Yahoo creating 1000-node Webmap with Hadoop Sep 2008 -Hive added to Hadoop as a contrib project