Introduction To Hadoop, Hive, An D Apache Spark

Transcription

Introduction to Hadoop, Hive, and Apache SparkConcepts and ToolsSeptember 20181

Outline OverviewWhat and Why?MapReduce FrameworkHDFS FrameworkHow?Hadoop Cluster MechanismsRelevant Technologies}– Hive, sqoop, Pig and NoSQL Apache Spark2

Overview of Hadoop3

Why Hadoop? Hadoop is a platorm for storage and processing huge datasets distributed on clusters of commodity machines. Two core components of Hadoop:– Processing engine (traditionally MapReduce, more recently, Spark)– HDFS (Hadoop Distributed File Systems)4

Why Hadoop (cont.)? Hadoop addresses “big data” challenges. “Big data” creates large business values today. Various industries face “big data” challenges.Without an efficient data processing approach, the data cannot create business values.– Many industries end up creating large amounts ofdata that they are unable to gain any insight from.*http://wikibon.org/5

Big Data!! What is “big data”? One SKA Survey will generate a data product of 4 EB. The DINGO uv grid dataset is 4 PB General requirement for SKA Phase 1 –– Initial bare minimum 600 PB– Annual increase at least 1 EB per year6

Core Components of Hadoop7

Core Components of Hadoop MapReduce/Spark– An efficient programming framework for processing parallelizable problems across huge datasets using a large number of commodity machines. HDFS– A distributed file system designed to efficiently allocate data across multiple commodity machines, and provide self-healing functions when some of them go wLowReadily availableSuper computerHighHighHard to obtain8

MapReduce Framework Map:– Extract something of interest from eachchunk of record. Reduce:Generalframework– Aggregate the intermediate outputs from the Map process. The Map and Reduce have different instantiations in different problems.9

MapReduce Framework10

MapReduce Framework Inputs and outputs of Mappers and Reducersare key value pairs k,v . Programmers must do the coding according tothe MapReduce Model– Specify Map method– Specify Reduce Method– Define the intermediate outputs in k,v format.11

HDFS Framework Hadoop Distributed File System (HDFS) is a highly fault-tolerant distributed file system for Hadoop.– Data storage infrastructure of Hadoop Cluster– Hadoop Processing engine (MapReduce or Spark) HDFS Specifically designed to work with MapReduce/Spark. Major assumptions:– Large data sets.– Hardware failure.– Write once, read many.12

HDFS Framework Key features of HDFS:– Fault Tolerance - Automatically and seamlessly recover from failures– Data Replication- to provide redundancy.– Load Balancing - Place data intelligently for maximum efficiency and utilization– Scalability- Add servers to increase capacity– “Moving computations is cheaper than moving data.”13

HDFS Framework Components of HDFS:– DataNodes Store the data with optimized redundancy.– Journal Node Coordinates the data nodes with the Name Node– NameNode (s) Manages the DataNodes. Secondary Name Node – failover capability14

Hadoop vs RDBMS Many businesses are turning from RDBMS to Hadoop-based systems for data management.Data ructured & UnstructuredVery highFast for large-scale dataPowerful analytical tools forbig-data.Mostly structuredLimitedVery fast for small-medium size data.Some limited built-in analytics. In a word, if businesses need to process and analyze large-scale, real-time data, choose Hadoop.Otherwise staying with RDBMS is still a wise choice.15

Hadoop vs Other Distributed Systems Common Challenges in Distributed Systems– Component Failure Individual computer nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space.– Network Congestion Data may not arrive at a particular point in time.– Communication Failure Multiple implementations or versions of client software may speak slightly different protocols from one another.– Security Data may be corrupted, or maliciously or improperly transmitted.– Failover modes. Hadoop is automated.16

Hadoop vs Other Distributed Systems Hadoop– Uses efficient programming model.– Efficient, automatic distribution of data and work across machines.– Good in component failure and network congestion problems.– Weak for security issues. (Although )17

Hadoop Cluster Architecture18

Cloudera A platorm that integrates many Hadoop-based products and services.19

Hadoop Architecture (1) Hadoop has a master/slave architecture. Typically one machine in the cluster is designated as the NameNode and another machine asthe JobTracker, exclusively.– These are the masters. The rest of the machines in the cluster act as both DataNode and TaskTracker.– These are the slaves.20

Hadoop Architecture (2) NameNode (master)– Manages the file system namespace.– Executes file system namespace operations like opening, closing, and renaming files and directories.– It also determines the mapping of data chunks to DataNodes.– Monitor DataNodes by receiving heartbeats. DataNodes (slaves)– Manage storage attached to the nodes that they run on.– Serve read and write requests from the file system’s clients.– Perform block creation, deletion, and replication upon instruction from the NameNode.21

Hadoop Architecture (3) JobTracker (master)– Receive jobs from client.– Talks to the NameNode to determine the location of the data– Manage and schedule the entire job.– Split and assign tasks to slaves (TaskTrackers).– Monitor the slave nodes by receiving heartbeats. TaskTrackers (slaves)– Manage individual tasks assigned by the JobTracker, including Map operations and Reduce operations.– Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept.– Send out heartbeat messages to the JobTracker to tell that it is still alive.– Notify the JobTracker when succeeds or fails.22

Hadoop Architecture Example 1mastersJob TrackerNameNode23

Cluster service layout24

25

Hadoop Resource Management Apache YARN (Yet Another Resource Negotiator) Decouples resource management from data processing requirements Provides resources for any processing frameworkcompatible with Hadoop Resource manager – dedicated scheduler, resideson one of the service nodes Node manager Daemons on each of the worker nodes in the cluster27

YARNYet Another Resource Negotiator Scheduler Resource Manager28

Zookeeper Zookeeper: A cluster management tool that supports coordination between nodes in a distributed system.– When designing a Hadoop-based application, a lot of coordination works need to be considered. Writing these functionalities is difficult. Zookeeper provides services that can be used to develop distributed applications. Zookeeper provide services such as :Configuration managementSynchronizationGroup servicesLeader electionEtc.29

Hadoop evolution30

Relevant Technologies31

Technologies relevant to HadoopZookeeperPig32

Hadoop Ecosystem33

Hive Hive: data warehousing application based onHadoop.– Query language is HiveQL, which looks similar to SQL.– Translate HiveQL into MapReduce or Spark jobs.– Store & manage data on HDFS.– Can be used as an interface for HBase, MongoDB,Cassandra etc.34

Hive – a datawarehouse for HDFS Simply put, Hive is a metadata layer on HDFS datasets Different table types– Internal or external tables Different table formats– Sequence, text, Parquet. ORC, RCFile Compression– Compression codecs – snappy, gzip, zlib What Hive gives us– SQL, Partitioning, Indexes35

36

Hive Partitioning37

Sqoop Provides simple interface for importing data straight from relational DB to Hadoop.38

NoSQL HDFS- Append only file system– A file once created, written, and closed need not be changed.– To modify any portion of a file that is already written, one must rewrite the entire file and replace the old file.– Not efficient for random read/write.– Use relational database? Not scalable. Solution: NoSQL––––Stands for Not Only SQL.Class of non-relational data storage systems.Usually do not require a pre-defined table schema in advance.Scale horizontally. VS vertically.39

NoSQL NoSQL data store models:– Document store– Wide-column store– Key Value store– Graph store NoSQL goDBCouchDBRedisRiakNeo4J .40

HBase HBase– Hadoop Database. Good integration with Hadoop.– A datastore on HDFS that supports random read and write.– A distributed database modeled after Google BigTable.– Best fit for very large Hadoop projects.41

Pig A high-level platorm for creating MapReduceprograms used in Hadoop. Translate into efficient sequences of one or more MapReduce jobs. Executing the MapReduce jobs.42

Need for High-Level Languages Hadoop is great for large data processing!– But writing Mappers and Reducers in Java for everything is verbose and slow. Solution: develop higher-level data processinglanguages, on later processing engines.– Hive: HiveQL is like SQL.– Pig: Pig Latin similar to Perl.– Use Python!43

Apache Spark45

Apache Spark Background Many of the aforementioned Big Data technologies (Hbase, Hive, Pig, Mahout, etc.) are not integrated with each other. This can lead to reduced performance and integration difficulties. However, Apache Spark is a state-of-the-art Big Data technology that integrates many of the core functions from each of these technologies under one framework. Biggest advantage Spark offers over Map Reduce is in memoryprocessing46

Apache Spark Background Apache Spark is fast and general engine for large-scaledata processing built upon distributed file systems.– Most common is Hadoop Distributed File System (HDFS). Claims to be 100 times faster than MapReduce and supports Java, Python, and Scala API’s. Spark is good for distributed computing tasks, and canhandle batch, interactive, and real-time data within a single framework. Spark can also be run independently of Hadoop as well.47

Residual Distributed Datasets The core abstraction for working with data Spark automatically distributes the data across the cluster and parallelizes the operations An RDD is simply a distributed collection of objects RDDs are split into partitions, which can be computed on different cluster nodes RDDs can contain any type of Python, Java or Scala object, including user-defined classes48

Aside - Spark and Machine Learning Why it’s important! Libraries available (Python on Spark)– Mllib– astroML– astroPy– Theano– Tensorflow– The usual suspects (numpy, scipy)– More 49

Spark Deployment Options Standalone Spark occupies the place ontop of HDFS. Spark and MapReduce run side-by-side for all jobs. Hadoop Yarn Spark runs on Yarn without any pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of the stack. Spark in MapReduce (SIMR) Spark in MapReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.51

Spark on YARN Resource management, scheduling and security controlled by YARN Each Spark executor runs as a YARN container Spark vs MapReduce– MapReduce schedules a container and starts a JVM for each task– Spark hosts multiple tasks within the same container52

Spark on YARN (continued) Each application has ApplicationMaster process ApplicationMaster requests resources from the Resource Manager When resources allocated, instructs the NodeManagers to start containers on it’s behalf. Deployment Modes– Cluster – Driver runs ApplicationMaster on a cluster host chosen by YARN– Client – Driver runs on the host where the job is submitted53

Spark Components Regardless of deployment, Sparkprovides four standard libraries.– Spark SQL – allows for SQL like queries of data– Spark Streaming – allows real-timeprocessing of data– GraphX – allows graph analytics– Mllib – provides Machine Learningtools.54

Spark Components – Spark SQL–Spark SQL introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data. Consider the examples below.–From Hive:c HiveContext(sc)rows c.sql(“select text, year, from hivetable”)rows.filter(lamba r: r.year 2013).collect()–From (“tweets”)c.sql(“select text, user.name from tweets”)55

Hadoop is powerful. But where do we find somany commodity machines?56

Amazon Elastic MapReduce Setting up Hadoop clusters on the cloud. Amazon Elastic MapReduce (AEM).– Powered by Hadoop.– Uses EC2 instances as virtual servers for the master and slave nodes. Key Features:– No need to do server maintenance.– Resizable clusters.– Hadoop application support including HBase, Pig, Hive etc.– Easy to use, monitor, and manage.57

Spark Components – MLlib MLlib (Machine Learning Library) is a distributedmachine learning framework above Spark. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface). Spark MLlib provides a variety of machine learning classic algorithms.58

Spark Components – MLlib Algorithms Classification – logistic regression, linear SVM, Naïve Bayes, classification tree Regression – Generalized Linear Models (GLMs), Regression tree Collaborative filtering – Alternating Least Squares (ALS), Non-negative MatrixFactorization (NMF) Clustering – k-means Decomposition – SVD, PCA Optimization – stochastic gradient descent, L-BFGS59

Interfaces! How do we access Hadoop and Spark? CLI exist for– Hive– Pig– Spark– Pyspark– Spark-submit60

Better interfaces! Web interfaces – we will be using– Hue for Hive and Pig;– Jupyter for Pyspark; but others exist as well

– Weak for security issues. (Although ) 17. Hadoop Cluster Architecture 18. Cloudera A platorm that integrates many Hadoop-base d products and services. 19. Hadoop Architecture (1) Hadoop has a master/slave architecture. Typically one machine in the cluster is designat ed as theNameNodeand another machine as theJobTracker, exclusively. – These are the masters. The rest of .