EECS E6893 Big Data Analytics Spark Dataframe, Spark SQL, Hadoop Metrics

Transcription

EECS E6893 Big Data AnalyticsSpark Dataframe, Spark SQL, Hadoop metricsGuoshiwen Han, gh2567@columbia.edu10/1/20211

Agenda Spark Dataframe Spark SQL Hadoop metrics2

Spark Dataframe An abstraction, an immutable distributed collection of data like RDDData is organized into named columns, like a table in DBCreate from RDD, Hive table, or other data sourcesEasy conversion with Pandas Dataframe3

Spark Dataframe: read from csv file4

Spark Dataframe: common operations5

Spark Dataframe: common operations6

Spark Dataframe: common operations7

Spark Dataframe: common operations8

Spark Dataframe: conversion with Pandas9

Work with Spark SQL10

Hadoop metrics11

Hadoop metrics12

Hadoop metrics13

Collecting HDFS metrics Collecting NameNode metrics via API Collecting DataNode metrics via API Collecting HDFS metrics via JMX14

Collecting HDFS metrics NameNode HTTP APIThe NameNode offers a summary ofhealth and performance metricsthrough an easy-to-use web UI. Bydefault, the UI is accessible via port50070, so point a web browser at:http:// namenodehost :5007015

Collecting HDFS metrics DataNode HTTP APIA high-level overview of the health ofyour DataNodes is available in theNameNode dashboard, under theDatanodes node)16

Collecting MapReduce countersMapReduce counters provide information on MapReduce task execution, like CPU time andmemory used. They are dumped to the console when invoking Hadoop jobs from thecommand line, which is great for spot-checking as jobs run, but more detailed analysisrequires monitoring counters over time.17

Collecting MapReduce counters18

Collecting Hadoop YARN metricsBy default, YARN exposes all of itsmetrics on port 8088, via the jmxendpoint. Hitting this API endpoint onyour ResourceManager gives you all ofthe metrics from part two of this series,19

Third-party tools Apache Ambari Cloudera Manager20

Third-party tools Apache Ambariambari-server setupservice ambari-server startpoint your browser to AmbariHost :8080 and loginwith the default user admin andpassword admin21

Third-party tools Apache AmbariSelect “Launch Install Wizard”,the series of screens that follow,you will be prompted for hosts tobe monitored and credentials toconnect to each host in yourcluster, then you’ll be prompted toconfigure application-specificsettings.22

Third-party tools Cloudera Managerservice cloudera-scm-server startpoint your browser to ClouderaHost :7180 and login with the default user admin andpassword admin.23

Hadoop metrics: take wordcount as an examplefilescontentaa.txtthis is data andintelligencebb.txthello everyonecc.txtwelcometaskSort by alphabet, countthe number ofoccurrences of eachword in the three files.24

inputoutputand 1data 1This is data and intelligenceHello everyonewelcomeeveryone 1hello 1intelligence 1is 1this 1welcome 125

Hadoop metrics: take wordcount as an example26

Hadoop metrics: take wordcount as an example27

Hadoop metrics: take wordcount as an exampleThe parameters after wordcount is input files except last one.The last one “newwordcount” is output folder.28

Hadoop metrics: take wordcount as an example29

Hadoop metrics: take wordcount as an example30

References tarted.html -dataframe-andoperations/ https://spark.apache.org/docs/latest/ml-guide.html n-problem-96396065d2aa etrics/#namenode-anddatanode-metrics-via-jmx ct-dist/hadoopcommon/Metrics.html31

Spark Dataframe An abstraction, an immutable distributed collection of data like RDD Data is organized into named columns, like a table in DB Create from RDD, Hive table, or other data sources Easy conversion with Pandas Dataframe 3