About This Tutorial

Transcription

HadoopAbout this tutorialHadoop is an open-source framework that allows to store and process big data in adistributed environment across clusters of computers using simple programming models.It is designed to scale up from single servers to thousands of machines, each offering localcomputation and storage.This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, andHadoop Distributed File System.AudienceThis tutorial has been prepared for professionals aspiring to learn the basics of Big DataAnalytics using Hadoop Framework and become a Hadoop Developer. SoftwareProfessionals, Analytics Professionals, and ETL developers are the key beneficiaries of thiscourse.PrerequisitesBefore you start proceeding with this tutorial, we assume that you have prior exposure toCore Java, database concepts, and any of the Linux operating system flavors.Copyright & Disclaimer Copyright 2014 by Tutorials Point (I) Pvt. Ltd.All the content and graphics published in this e-book are the property of Tutorials Point (I)Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republishany contents or a part of contents of this e-book in any manner without written consentof the publisher.We strive to update the contents of our website and tutorials as timely and as precisely aspossible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of ourwebsite or its contents including this tutorial. If you discover any errors on our website orin this tutorial, please notify us at contact@tutorialspoint.comi

HadoopTable of ContentsAbout this tutorial ······· iAudience ····················· iPrerequisites ··············· iCopyright & ····· iTable of Contents ······· ii1.HADOOP BIG DATA OVERVIEW ·············· 1What is Big Data? ······· 1What Comes Under Big Data? ············· 1Benefits of Big Data ···· 2Big Data Technologies ························· 2Operational vs. Analytical �········· 3Big Data Challenges ···· 42.HADOOP BIG DATA SOLUTIONS ············· 5Traditional Enterprise Approach ········· 5Google’s Solution ······· 5Hadoop ······················ 63.HADOOP INTRODUCTION ······················ 7Hadoop Architecture ·· 7MapReduce ················ 7Hadoop Distributed File System ·········· 8How Does Hadoop Work? ··················· 8Advantages of Hadoop ························ 9ii

Hadoop4.HADOOP ENVIRONMENT SETUP ·········· 10Pre-installation Setup ························ 10Installing Java ··········· 11Downloading ·· 12Hadoop Operation Modes ················· 13Installing Hadoop in Standalone Mode ······················ 13Installing Hadoop in Pseudo Distributed Mode ·········· 15Verifying Hadoop Installation ············ 185.HADOOP HDFS ······················· 21Features of HDFS ······ 21HDFS Architecture ···· 21Goals of HDFS ··········· 226.HADOOP HDFS OPERATIONS ··············· 23Starting HDFS ··········· 23Listing Files in HDFS ·· 23Inserting Data into HDFS ··················· 23Retrieving Data from HDFS ················ 24Shutting Down the HDFS ··················· 247.HADOOP COMMAND �············ 25HDFS Command Reference ················ 258.HADOOP �···· 28What is MapReduce? ························ 28The Algorithm ·········· 28Inputs and Outputs (Java Perspective) ······················· 29iii

HadoopTerminology ············· 29Example Scenario ····· 30Compilation and Execution of Process Units Program ························· 33Important Commands ······················· 36How to Interact with MapReduce Jobs ······················· 389.HADOOP STREAMING · 40Example using Python ······················· 40How Streaming � 42Important Commands ······················· 4210. HADOOP MULTI-NODE CLUSTER ········· 44Installing Java ··········· 44Creating User Account ······················· 45Mapping the nodes ·· 45Configuring Key Based Login ············· 46Installing Hadoop ····· 46Configuring Hadoop · 46Installing Hadoop on Slave �······· 48Configuring Hadoop on Master Server ······················· 48Starting Hadoop Services ·················· 49Adding a New DataNode in the Hadoop Cluster ········· 49Adding a User and SSH Access ··········· 49Set Hostname of New Node ·············· 50Start the DataNode on New Node ····· 51Removing a DataNode from the Hadoop Cluster ········ 51iv

1. HADOOP BIG DATA OVERVIEWHadoop“90% of the world’s data was generated in the last few years.”Due to the advent of new technologies, devices, and communication means like socialnetworking sites, the amount of data produced by mankind is growing rapidly every year. Theamount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes.If you pile up the data in the form of disks it may fill an entire football field. The same amountwas created in every two days in 2011, and in every ten minutes in 2013. This rate is stillgrowing enormously. Though all this information produced is meaningful and can be usefulwhen processed, it is being neglected.What is Big Data?Big Data is a collection of large datasets that cannot be processed using traditional computingtechniques. It is not a single technique or a tool, rather it involves many areas of businessand technology.What Comes Under Big Data?Big data involves the data produced by different devices and applications. Given below aresome of the fields that come under the umbrella of Big Data. Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It capturesvoices of the flight crew, recordings of microphones and earphones, and theperformance information of the aircraft. Social Media Data: Social media such as Facebook and Twitter hold information andthe views posted by millions of people across the globe. Stock Exchange Data: The stock exchange data holds information about the ‘buy’and ‘sell’ decisions made on a share of different companies made by the customers. Power Grid Data: The power grid data holds information consumed by a particularnode with respect to a base station. Transport Data: Transport data includes model, capacity, distance and availability ofa vehicle. Search Engine Data: Search engines retrieve lots of data from different databases.5

HadoopThus Big Data includes huge volume, high velocity, and extensible variety of data. The datain it will be of three types. Structured data: Relational data. Semi Structured data: XML data. Unstructured data: Word, PDF, Text, Media Logs.Benefits of Big Data Using the information kept in the social network like Facebook, the marketing agenciesare learning about the response for their campaigns, promotions

Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. Hadoop Architecture At its core, Hadoop has two major layers namely: (a) Processing/Computation layer (MapReduce), and (b) Storage layer File Size: 845KBPage Count: 21