Transcription
APACHE HADOOPJERRIN JOSEPHCSU ID#2578741
CONTENTS HadoopHadoop Distributed File System (HDFS) Hadoop MapReduce Introduction Architecture Operations Conclusion References
ABSTRACTHadoop is an efficient Big data handling tool. Reduced the data processing time from ‘days’ to‘hours’. Hadoop Distributed File System(HDFS) is thedata storage unit of Hadoop. Hadoop MapReduce is the data processing unitwhich works on distributed processing principle.
INTRODUCTION What is Big Data? Bulk Amount UnstructuredLots of Applications which need to handle hugeamount of data (in terms of 500 TB per day) If a regular machine need to transmit 1TB ofdata through 4 channels : 43 Minutes. What if 500 TB ?
HADOOP “The Apache Hadoop software library is aframework that allows for the distributedprocessing of large data sets across clusters ofcomputers using simple programming models”[1] Core Components : HDFS: large data sets across clusters ofcomputers. Hadoop MapReduce: the distributedprocessing using simple programming models
HADOOP : KEY FEATURESHigh Scalability Highly Tolerant to Software & HardwareFailures High Throughput Best for larger files with less in number Performs fast and parallel execution of Jobs Provides Streaming access to data Can be built out of commodity hardware
HADOOP: DRAWBACKSNot good for Low-latency data access Not good for Small files with large in number Not good for Multiple write files Do not encryption at storage level or networklevel Have a high complexity security model Hadoop is not a Database: Hence cannot alter afile.
HADOOP ARCHITECTURE
HADOOPDISTRIBUTED FILESYSTEM (HDFS)
HADOOP DISTRIBUTED FILESYSTEM (HDFS) Storage unit of Hadoop Relies on principles of Distributed File System. HDFS have a Master-Slave architecture Main Components: Name Node : Master Data Node : Slave 3 replicas for each block Default Block Size : 64MB
HDFS: KEY FEATURESHighly fault tolerant. (automatic failure recoverysystem) High throughput Designed to work with systems with vary largefile (files with size in TB) and few in number. Provides streaming access to file system data. Itis specifically good for write once read many kindof files (for example Log files). Can be built out of commodity hardware. HDFSdoesn't need highly expensive storage devices.
HDFS ARCHITECTURE
NAME NODEMaster of HDFS Maintains and Manages data on Data Nodes High reliability Machine (can be even RAID) Expensive Hardware Stores NO data; Just holds Metadata! Secondary Name Node: Reads from RAM of Name Node and stores it to harddisks periodically.Active & Passive Name Nodes from Gen2 Hadoop
DATA NODESSlaves in HDFS Provides Data Storage Deployed on independent machines Responsible for serving Read/Write requests fromClient. The data processing is done on Data Nodes.
HDFS OPERATION
HDFS OPERATIONClient makes a Write request to Name Node Name Node responds with the information abouton available data nodes and where data to bewritten. Client write the data to the addressed DataNode. Replicas for all blocks are automatically createdby Data Pipeline. If Write fails, Data Node will notify the Clientand get new location to write. If Write Completed Successfully,Acknowledgement is given to Client Non-Posted Write by Hadoop
HDFS: FILE WRITE
HDFS: FILE READ
HADOOPMAPREDUCE
HADOOP MAPREDUCESimple programming model Hadoop Processing Unit MapReduce also have Master-Slave architecture Main Components: Job Tracker : Master Task Tracker : Slave From Google’s MapReduce Do not fetch data to Master Node; Processed dataat Slave Node and returns output to Master
HADOOP MAPREDUCEImplemented using Maps and Reduces Split by FileInputFormat Maps Inheriting Mapper Class Produces (key, value) pair as intermediate resultfrom data. ReducesInheriting Reducer Class Produces required output from intermediate resultproduced by Maps.
JOB TRACKERMaster in MapReduce Receives the job request from Client Governs execution of jobs Makes the task scheduling decision TASK TRACKERSlave in MapReduce Governs execution of Tasks Periodically reports the progress of tasks
MAPREDUCE ARCHITECTURE
MAPREDUCE OPERATIONS
MAPREDUCE OPERATIONS
MAPREDUCE OPERATIONS
MAPREDUCE OPERATIONS
APACHE HIVE
HIVEBuilt on top of Hadoop Supports SQL like Query Language : Hive-QL Data in Hive is organized into tables Provides structure for unstructured Big Data Work with data inside HDFS Tables Data : File or Group of Files in HDFS Schema : In the form of metadata stored in Relational Database Have a corresponding HDFS directory Data in a table is Serialized Supports Primitive Column Types and NestableCollection Types (Array and Map)
HIVE QUERY LANGUAGESQL like language DDL : to create tables with specific serializationformats DML : to load data from external sources andinsert query results into Hive tables Do not support updating and deleting rows inexisting tables Supports Multi-Table insert Supports custom map-reduce scripts written inany language Can be extended with custom functions (UDFs) User Defined Transformation Function(UDTF) User Defined Aggregation Function (UDAF)
HIVE ARCHITECTUREExternal Interfaces: Web UI : Management Hive CLI : Run Queries, Browse Tables, etc API : JDBC, ODBC Metastore : Driver : System catalog which contains metadata about Hivetablesmanages the life cycle of a Hive-QL statement duringcompilation, optimization and executionCompiler : translates Hive-QL statement into a plan whichconsists of a DAG of map-reduce jobs
HIVE ARCHITECTURE
HIVE ACHIEVEMENTS & FUTUREPLANSFirst step to provide warehousing layer forHadoop(Web-based Map-Reduce data processingsystem) Accepts only sub-set of SQL: Working to subsumeSQL syntax Working on Rule-based optimizer : Plans to buildCost-based optimizer Enhancing JDBC and ODBC drivers for makingthe interactions with commercial BI tools. Working on making it perform better
APACHE HBASE
H-BASEDistributed Column-oriented database on top ofHadoop/HDFS Provides low-latency access to single rows frombillions of records Column oriented: OLAP Best for aggregation High compression rate: Few distinct values Do not have a Schema or data type Built for Wide tables : Millions of columnsBillions of rows Denormalized data Master-Slave architecture
H-BASE ARCHITECTURE
HMASTER SERVERLike Name Node in HDFS Manages and Monitors HBase ClusterOperations Assign Region to Region Servers Handling Load-balancing and Splitting REGION SERVERLike Data Node in HDFS Highly Scalable Handle Read/ Write Requests Direct communication with Clients
INTERNAL ARCHITECTURETablesRegions Store MemStore FileStoreBlocks Column Families
APACHEZOOKEEPER
ZOOKEEPER What is ZooKeeper?Distributed coordination service for distributedapplications Like a Centralized Repository Challenges for Distributed Applications ZooKeeper Goals
ZOOKEEPER ARCHITECTURE
ZOOKEEPER ARCHITECTUREAlways Odd number of nodes. Leader is elected by voting. Leader and Follower can get connected to Clientsand Perform Read Operations Write Operation is done only by the Leader. Observer nodes to address scaling problems
ZOOKEEPER DATA MODEL
ZOOKEEPER DATA MODEL Z Nodes:Similar to Directory in File system Container for data and other nodes Stores Statistical information and User data up to1MB Used to store and share configuration informationbetween applications Z Node TypesPersistent Nodes Ephemeral Nodes Sequential Nodes Watch : Event system for client notification
PROJECTS & TOOLS ONHADOOPHBase Hive Pig Jaql ZooKeeper AVRO UIMA Sqoop
CONCLUSIONHadoop is a successful solution for Big DataHandling Hadoop expanded from a simple project to thelevel of a platform The projects and tools on Hadoop are proof forthe successfulness of Hadoop.
REFERENCES[1] "Apache Hadoop", http://hadoop.apache.org/[2] “Apache Hive”, http://hive.apache.org/[3] “Apache HBase”, https://hbase.apache.org/[4] “Apache ZooKeeper”, http://zookeeper.apache.org/[5] Jason Venner, "Pro Hadoop", Apress Books, 2009[6] "Hadoop Wiki", http://wiki.apache.org/hadoop/[7] Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding,Yun Tian, James Majors, Adam Manzanares, XiaoQin, " Improving MapReduce Performance throughData Placement in Heterogeneous HadoopClusters", 19th International Heterogeneity inComputing Workshop, Atlanta, Georgia, April 2010
REFERENCES[8] Dhruba Borthakur, The Hadoop DistributedFile System: Architecture and Design, TheApache Software Foundation 2007.[9] "Apache Hadoop",http://en.wikipedia.org/wiki/Apache Hadoop[10] "Hadoop Overview",http://www.revelytix.com/?q content/hadoopoverview[11] Konstantin Shvachko, Hairong Kuang, SanjayRadia, Robert Chansler, The Hadoop DistributedFile System, Yahoo!, Sunnyvale, California USA,Published in: Mass Storage Systems andTechnologies (MSST), 2010 IEEE 26thSymposium.
REFERENCES[12] Vinod Kumar Vavilapalli, Arun C Murthy, ChrisDouglas, Sharad Agarwal, Mahadev Konar, RobertEvans, Thomas Graves, Jason Lowe, Hitesh Shah,Siddharth Seth, Bikas Saha, Carlo Curino, OwenO’Malley, Sanjay Radia, Benjamin Reed, EricBaldeschwieler, Apache Hadoop YARN: Yet AnotherResource Negotiator, ACM Symposium on CloudComputing 2013, Santa Clara, California.[13] Raja Appuswamy, Christos Gkantsidis, DushyanthNarayanan, Orion Hodson, and Antony Rowstron,Scale-up vs Scale-out for Hadoop: Time to rethink?,Microsoft Research, ACM Symposium on CloudComputing 2013, Santa Clara, California.
ABSTRACT Hadoop is an efficient Big data handling tool. Reduced the data processing time from ‘days’to ‘hours’.