Basic Hadoop Programming Skills - NUS Computing

Transcription

Basic Hadoop Programming Skills1/23/2016

Useful ResourcesOfficial MapReduce Tutorial The instruction on how to install aconfigured Virtual image of Ubuntu withHadoop deployment 2

Configure Your VM Before Usage Follow the instruction in Slide 2 Set the memory If you are running VirtualBox onWindows 8 and it is extremely slow,please refer to box-is-really-slow-on-windows-83

Basic Commands of Ubuntu Open terminal4

Basic Shell Commands(in the terminal) List directory hadoop@ubuntu-VirtualBox: ls Create directory hadoop@ubuntu-VirtualBox: mkdir project Browse into directory hadoop@ubuntu-VirtualBox: / cd project Download file hadoop@ubuntu-VirtualBox: /project wget5

Basic Shell Commands(in the terminal) Remove directory hadoop@ubuntu-VirtualBox: rm -r project Remove file hadoop@ubuntu-VirtualBox: rm filename Move file hadoop@ubuntu-VirtualBox: / mvold directory/filename new directory Print working directory hadoop@ubuntu-VirtualBox: pwd More information6

Format HDFS Format HDFS (all data will be deleted) hadoop@ubuntu-VirtualBox: hadoopnamenode -format Attention! you only need to run thiscommand ONCE when you start Ubuntu VMfor the very first time i.e. unless the HDFSis corrupted, you should not run thiscommand again.7

Start/Stop Hadoop Start Hadoop hadoop@ubuntu-VirtualBox: start-all.sh See if Hadoop is running hadoop@ubuntu-VirtualBox: jps hadoop@ubuntu-VirtualBox: hadoop dfsadmin -report Stop Hadoop (when you are done) hadoop@ubuntu-VirtualBox: stop-all.sh8

Hadoop Web Interfaces Browse the followings in web browser HDFS status: http://localhost:50070 Hadoop job status: http://localhost:500309

Basic Hadoop Commands HDFS shell commands: Create/remove folder:hadoop fs –mkdir/‐rmr FOLDER NAMEhadoop@ubuntu‐VirtualBox: hadoop fs ‐mkdir /datahadoop@ubuntu‐VirtualBox: hadoop fs ‐mkdir /data/input List folder:hadoop fs –ls PATHhadoop@ubuntu‐VirtualBox: hadoop fs –ls /data10

Basic Hadoop Commands HDFS shell commands: Data transfering:hadoop fs –cp/‐mv/‐put/‐get src desthadoop@ubuntu‐VirtualBox: hadoop fs ‐put my local text file /data/input11

Compile source code Compile hadoop code: javac -classpath hadoop classpath -ddestination dir source dir/Filename.java(single quotation mark is the one above tab key) Generate a jar file: jar -cvf WordCount.jar -C destination dir .(there is a dot in the end)12

Basic Hadoop Commands Launch job commands:hadoop jar PATH TO JAR FILE classname parametersE.g., launch the above compiled WordCounthadoop jar WordCount.jar mylab0.WordCount /data/input/data/outputDisplay the job results:hadoop@ubuntu‐VirtualBox: hadoop fs –ls /data/outputhadoop fs ‐cat /data/output/part‐r‐00000 Note: if you run a job multiple times, need todelete the output folder every time before youlaunch the jobhadoop@ubuntu‐VirtualBox: hadoop fs –rmr /data/output13

Hadoop Programming Using Eclipse IDE Start from Terminal: Start from GUI:Click Home FolderDouble Click“eclipse” Folder14

Hadoop Programming Using Eclipse IDE Start from GUI (cont.):Double Click“eclipse” executable15

Create MapReduce ProjectClick“File” – “New” – “Project”16

Select “Map/Reduce” – “Map/Reduce Project”Then click “Next”17

Name your project hereThen click “Browse” toselect Hadoop location(/home/hadoop/hadoop/hadoop-1.2.1)18

Then, click OK19

Finally, click “Finish”20

Create a package1.2.3.4.On the left panel, unfold “cs5344” projectRight click on “src”Select “New” – “Package”Name your package “mylab0”21

Programming using Eclipse Download sample codes fromhttp://www.comp.nus.edu.sg/ shilei/download/cs5344/wordcount.zipUnzip the java files and copy them topackage mylab0 Double click the java files andstart editing 22

Compile and make jar in Eclipse Right click the java source file, choose“Export”23

Compile and make jar in Eclipse Select “Java” – “JAR File”, then click“Next”24

Compile and make jar in EclipseSpecify thedirectory wherethe jar file issaved, then clicknext Click “Finish”button tocomplete 25

Notes on MultipleInputsMultipleInputs enables using more thanone input files for single MapReducejob, e.g. join operation. Each input file may require specificprocessing, therefore users can specifydedicated Mapper for each input In Lab 1, you may use MultipleInputs(other approaches are also okay). 26

MultipleInputs Example Calculate the total occurrences of each word intwo text files27

Code structurepackage mylab0;import ts;public class WordCountMultipleInputs {class Mapper1 extends Mapper K1, V1, K2, V2 {//process file1}class Mapper2 extends Mapper K1, V1, K2, V2 {//process file2}class MultiInputReducer extends Reducer K1, V1, K2, V2 {//aggregate results from file1 and file2}public static void main(String[] args){.MultipleInputs.addInputPath(Job, Path1, InputFormat1, Mapper1.class);MultipleInputs.addInputPath(Job, Path2, InputFormat2, Mapper2.class);.}}28

Basic Shell Commands (in the terminal) Remove directory hadoop@ubuntu-VirtualBox: rm -r project Remove file hadoop@ubuntu-VirtualBox: rm filename Move file hadoop@ubuntu-VirtualBox: / mv old_directory/filename new_directory Print working directory hadoop@ubuntu-VirtualBox: pwd