JerrinJoseph Hadoop Ppt - Cis.csuohio.edu

Transcription

APACHE HADOOPJERRIN JOSEPHCSU ID#2578741

CONTENTS HadoopHadoop Distributed File System (HDFS) Hadoop MapReduce Introduction Architecture Operations Conclusion References

ABSTRACTHadoop is an efficient Big data handling tool. Reduced the data processing time from ‘days’ to‘hours’. Hadoop Distributed File System(HDFS) is thedata storage unit of Hadoop. Hadoop MapReduce is the data processing unitwhich works on distributed processing principle.

INTRODUCTION What is Big Data? Bulk Amount UnstructuredLots of Applications which need to handle hugeamount of data (in terms of 500 TB per day) If a regular machine need to transmit 1TB ofdata through 4 channels : 43 Minutes. What if 500 TB ?

HADOOP “The Apache Hadoop software library is aframework that allows for the distributedprocessing of large data sets across clusters ofcomputers using simple programming models”[1] Core Components : HDFS: large data sets across clusters ofcomputers. Hadoop MapReduce: the distributedprocessing using simple programming models

HADOOP : KEY FEATURESHigh Scalability Highly Tolerant to Software & HardwareFailures High Throughput Best for larger files with less in number Performs fast and parallel execution of Jobs Provides Streaming access to data Can be built out of commodity hardware

HADOOP: DRAWBACKSNot good for Low-latency data access Not good for Small files with large in number Not good for Multiple write files Do not encryption at storage level or networklevel Have a high complexity security model Hadoop is not a Database: Hence cannot alter afile.

HADOOP ARCHITECTURE

HADOOPDISTRIBUTED FILESYSTEM (HDFS)

HADOOP DISTRIBUTED FILESYSTEM (HDFS) Storage unit of Hadoop Relies on principles of Distributed File System. HDFS have a Master-Slave architecture Main Components: Name Node : Master Data Node : Slave 3 replicas for each block Default Block Size : 64MB

HDFS: KEY FEATURESHighly fault tolerant. (automatic failure recoverysystem) High throughput Designed to work with systems with vary largefile (files with size in TB) and few in number. Provides streaming access to file system data. Itis specifically good for write once read many kindof files (for example Log files). Can be built out of commodity hardware. HDFSdoesn't need highly expensive storage devices.

HDFS ARCHITECTURE

NAME NODEMaster of HDFS Maintains and Manages data on Data Nodes High reliability Machine (can be even RAID) Expensive Hardware Stores NO data; Just holds Metadata! Secondary Name Node: Reads from RAM of Name Node and stores it to harddisks periodically.Active & Passive Name Nodes from Gen2 Hadoop

DATA NODESSlaves in HDFS Provides Data Storage Deployed on independent machines Responsible for serving Read/Write requests fromClient. The data processing is done on Data Nodes.

HDFS OPERATION

HDFS OPERATIONClient makes a Write request to Name Node Name Node responds with the information abouton available data nodes and where data to bewritten. Client write the data to the addressed DataNode. Replicas for all blocks are automatically createdby Data Pipeline. If Write fails, Data Node will notify the Clientand get new location to write. If Write Completed Successfully,Acknowledgement is given to Client Non-Posted Write by Hadoop

HDFS: FILE WRITE

HDFS: FILE READ

HADOOPMAPREDUCE

HADOOP MAPREDUCESimple programming model Hadoop Processing Unit MapReduce also have Master-Slave architecture Main Components: Job Tracker : Master Task Tracker : Slave From Google’s MapReduce Do not fetch data to Master Node; Processed dataat Slave Node and returns output to Master

HADOOP MAPREDUCEImplemented using Maps and Reduces Split by FileInputFormat Maps Inheriting Mapper Class Produces (key, value) pair as intermediate resultfrom data. ReducesInheriting Reducer Class Produces required output from intermediate resultproduced by Maps.

JOB TRACKERMaster in MapReduce Receives the job request from Client Governs execution of jobs Makes the task scheduling decision TASK TRACKERSlave in MapReduce Governs execution of Tasks Periodically reports the progress of tasks

MAPREDUCE ARCHITECTURE

MAPREDUCE OPERATIONS

MAPREDUCE OPERATIONS

MAPREDUCE OPERATIONS

MAPREDUCE OPERATIONS

APACHE HIVE

HIVEBuilt on top of Hadoop Supports SQL like Query Language : Hive-QL Data in Hive is organized into tables Provides structure for unstructured Big Data Work with data inside HDFS Tables Data : File or Group of Files in HDFS Schema : In the form of metadata stored in Relational Database Have a corresponding HDFS directory Data in a table is Serialized Supports Primitive Column Types and NestableCollection Types (Array and Map)

HIVE QUERY LANGUAGESQL like language DDL : to create tables with specific serializationformats DML : to load data from external sources andinsert query results into Hive tables Do not support updating and deleting rows inexisting tables Supports Multi-Table insert Supports custom map-reduce scripts written inany language Can be extended with custom functions (UDFs) User Defined Transformation Function(UDTF) User Defined Aggregation Function (UDAF)

HIVE ARCHITECTUREExternal Interfaces: Web UI : Management Hive CLI : Run Queries, Browse Tables, etc API : JDBC, ODBC Metastore : Driver : System catalog which contains metadata about Hivetablesmanages the life cycle of a Hive-QL statement duringcompilation, optimization and executionCompiler : translates Hive-QL statement into a plan whichconsists of a DAG of map-reduce jobs

HIVE ARCHITECTURE

HIVE ACHIEVEMENTS & FUTUREPLANSFirst step to provide warehousing layer forHadoop(Web-based Map-Reduce data processingsystem) Accepts only sub-set of SQL: Working to subsumeSQL syntax Working on Rule-based optimizer : Plans to buildCost-based optimizer Enhancing JDBC and ODBC drivers for makingthe interactions with commercial BI tools. Working on making it perform better

APACHE HBASE

H-BASEDistributed Column-oriented database on top ofHadoop/HDFS Provides low-latency access to single rows frombillions of records Column oriented: OLAP Best for aggregation High compression rate: Few distinct values Do not have a Schema or data type Built for Wide tables : Millions of columnsBillions of rows Denormalized data Master-Slave architecture

H-BASE ARCHITECTURE

HMASTER SERVERLike Name Node in HDFS Manages and Monitors HBase ClusterOperations Assign Region to Region Servers Handling Load-balancing and Splitting REGION SERVERLike Data Node in HDFS Highly Scalable Handle Read/ Write Requests Direct communication with Clients

INTERNAL ARCHITECTURETablesRegions Store MemStore FileStoreBlocks Column Families

APACHEZOOKEEPER

ZOOKEEPER What is ZooKeeper?Distributed coordination service for distributedapplications Like a Centralized Repository Challenges for Distributed Applications ZooKeeper Goals

ZOOKEEPER ARCHITECTURE

ZOOKEEPER ARCHITECTUREAlways Odd number of nodes. Leader is elected by voting. Leader and Follower can get connected to Clientsand Perform Read Operations Write Operation is done only by the Leader. Observer nodes to address scaling problems

ZOOKEEPER DATA MODEL

ZOOKEEPER DATA MODEL Z Nodes:Similar to Directory in File system Container for data and other nodes Stores Statistical information and User data up to1MB Used to store and share configuration informationbetween applications Z Node TypesPersistent Nodes Ephemeral Nodes Sequential Nodes Watch : Event system for client notification

PROJECTS & TOOLS ONHADOOPHBase Hive Pig Jaql ZooKeeper AVRO UIMA Sqoop

CONCLUSIONHadoop is a successful solution for Big DataHandling Hadoop expanded from a simple project to thelevel of a platform The projects and tools on Hadoop are proof forthe successfulness of Hadoop.

REFERENCES[1] "Apache Hadoop", http://hadoop.apache.org/[2] “Apache Hive”, http://hive.apache.org/[3] “Apache HBase”, https://hbase.apache.org/[4] “Apache ZooKeeper”, http://zookeeper.apache.org/[5] Jason Venner, "Pro Hadoop", Apress Books, 2009[6] "Hadoop Wiki", http://wiki.apache.org/hadoop/[7] Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding,Yun Tian, James Majors, Adam Manzanares, XiaoQin, " Improving MapReduce Performance throughData Placement in Heterogeneous HadoopClusters", 19th International Heterogeneity inComputing Workshop, Atlanta, Georgia, April 2010

REFERENCES[8] Dhruba Borthakur, The Hadoop DistributedFile System: Architecture and Design, TheApache Software Foundation 2007.[9] "Apache Hadoop",http://en.wikipedia.org/wiki/Apache Hadoop[10] "Hadoop Overview",http://www.revelytix.com/?q content/hadoopoverview[11] Konstantin Shvachko, Hairong Kuang, SanjayRadia, Robert Chansler, The Hadoop DistributedFile System, Yahoo!, Sunnyvale, California USA,Published in: Mass Storage Systems andTechnologies (MSST), 2010 IEEE 26thSymposium.

REFERENCES[12] Vinod Kumar Vavilapalli, Arun C Murthy, ChrisDouglas, Sharad Agarwal, Mahadev Konar, RobertEvans, Thomas Graves, Jason Lowe, Hitesh Shah,Siddharth Seth, Bikas Saha, Carlo Curino, OwenO’Malley, Sanjay Radia, Benjamin Reed, EricBaldeschwieler, Apache Hadoop YARN: Yet AnotherResource Negotiator, ACM Symposium on CloudComputing 2013, Santa Clara, California.[13] Raja Appuswamy, Christos Gkantsidis, DushyanthNarayanan, Orion Hodson, and Antony Rowstron,Scale-up vs Scale-out for Hadoop: Time to rethink?,Microsoft Research, ACM Symposium on CloudComputing 2013, Santa Clara, California.

ABSTRACT Hadoop is an efficient Big data handling tool. Reduced the data processing time from ‘days’to ‘hours’.