Transcription
EECS E6893 Big Data AnalyticsSpark Dataframe, Spark SQL, Hadoop metricsGuoshiwen Han, gh2567@columbia.edu10/1/20211
Agenda Spark Dataframe Spark SQL Hadoop metrics2
Spark Dataframe An abstraction, an immutable distributed collection of data like RDDData is organized into named columns, like a table in DBCreate from RDD, Hive table, or other data sourcesEasy conversion with Pandas Dataframe3
Spark Dataframe: read from csv file4
Spark Dataframe: common operations5
Spark Dataframe: common operations6
Spark Dataframe: common operations7
Spark Dataframe: common operations8
Spark Dataframe: conversion with Pandas9
Work with Spark SQL10
Hadoop metrics11
Hadoop metrics12
Hadoop metrics13
Collecting HDFS metrics Collecting NameNode metrics via API Collecting DataNode metrics via API Collecting HDFS metrics via JMX14
Collecting HDFS metrics NameNode HTTP APIThe NameNode offers a summary ofhealth and performance metricsthrough an easy-to-use web UI. Bydefault, the UI is accessible via port50070, so point a web browser at:http:// namenodehost :5007015
Collecting HDFS metrics DataNode HTTP APIA high-level overview of the health ofyour DataNodes is available in theNameNode dashboard, under theDatanodes node)16
Collecting MapReduce countersMapReduce counters provide information on MapReduce task execution, like CPU time andmemory used. They are dumped to the console when invoking Hadoop jobs from thecommand line, which is great for spot-checking as jobs run, but more detailed analysisrequires monitoring counters over time.17
Collecting MapReduce counters18
Collecting Hadoop YARN metricsBy default, YARN exposes all of itsmetrics on port 8088, via the jmxendpoint. Hitting this API endpoint onyour ResourceManager gives you all ofthe metrics from part two of this series,19
Third-party tools Apache Ambari Cloudera Manager20
Third-party tools Apache Ambariambari-server setupservice ambari-server startpoint your browser to AmbariHost :8080 and loginwith the default user admin andpassword admin21
Third-party tools Apache AmbariSelect “Launch Install Wizard”,the series of screens that follow,you will be prompted for hosts tobe monitored and credentials toconnect to each host in yourcluster, then you’ll be prompted toconfigure application-specificsettings.22
Third-party tools Cloudera Managerservice cloudera-scm-server startpoint your browser to ClouderaHost :7180 and login with the default user admin andpassword admin.23
Hadoop metrics: take wordcount as an examplefilescontentaa.txtthis is data andintelligencebb.txthello everyonecc.txtwelcometaskSort by alphabet, countthe number ofoccurrences of eachword in the three files.24
inputoutputand 1data 1This is data and intelligenceHello everyonewelcomeeveryone 1hello 1intelligence 1is 1this 1welcome 125
Hadoop metrics: take wordcount as an example26
Hadoop metrics: take wordcount as an example27
Hadoop metrics: take wordcount as an exampleThe parameters after wordcount is input files except last one.The last one “newwordcount” is output folder.28
Hadoop metrics: take wordcount as an example29
Hadoop metrics: take wordcount as an example30
References tarted.html -dataframe-andoperations/ https://spark.apache.org/docs/latest/ml-guide.html n-problem-96396065d2aa etrics/#namenode-anddatanode-metrics-via-jmx ct-dist/hadoopcommon/Metrics.html31
Spark Dataframe An abstraction, an immutable distributed collection of data like RDD Data is organized into named columns, like a table in DB Create from RDD, Hive table, or other data sources Easy conversion with Pandas Dataframe 3