Hadoop: The Definitive Guide - Dalhousie University

Transcription

Hadoop: The Definitive GuideBeijing Cambridge Farnham Köln Sebastopol Taipei Tokyo

Hadoop: The Definitive GuideEditor:Production Editor:Proofreader:Printing History:TMIndexer:Cover Designer:Interior Designer:Illustrator:

Table of ContentsForeword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15v

3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75vi Table of Contents

5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153Table of Contents vii

7. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1758. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2119. Setting Up a Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245viii Table of Contents

10. Administering Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27311. Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301Table of Contents ix

12. HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343x Table of Contents

13. ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36914. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405Table of Contents xi

A. Installing Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465B. Cloudera’s Distribution for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471C. Preparing the NCDC Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479xii Table of Contents

Forewordxiii

xiv Foreword

Prefacexv

Administrative Notesimport org.apache.hadoop.io.*What’s in This Book?xvi Preface

Conventions Used in This BookConstant widthConstant width boldConstant width italicUsing Code ExamplesPreface xvii

Safari Books OnlineHow to Contact UsAcknowledgmentsxviii Preface

Preface xix

CHAPTER 1Meet HadoopData!1

2 Chapter 1: Meet Hadoop

Data Storage and AnalysisData Storage and Analysis 3

Comparison with Other SystemsRDBMS4 Chapter 1: Meet Hadoop

Traditional RDBMSMapReduceData sizeGigabytesPetabytesAccessInteractive and batchBatchUpdatesRead and write many timesWrite once, read many timesStructureStatic schemaDynamic ison with Other Systems 5

Grid Computing6 Chapter 1: Meet Hadoop

Comparison with Other Systems 7

Volunteer Computing8 Chapter 1: Meet Hadoop

A Brief History of HadoopThe Origin of the Name “Hadoop”JobTrackerA Brief History of Hadoop 9

10 Chapter 1: Meet Hadoop

Hadoop at Yahoo!A Brief History of Hadoop 11

The Apache Hadoop Project12 Chapter 1: Meet Hadoop

The Apache Hadoop Project 13

CHAPTER 2MapReduceA Weather DatasetData Format15

0057332130999991950010103004 51317 028783FM-12 91102681####USAF weather station identifierWBAN weather station identifierobservation dateobservation time# latitude (degrees x 1000)# longitude (degrees x 1000)# elevation (meters)# wind direction (degrees)# quality code# sky ceiling height (meters)# quality code# visibility distance (meters)# quality code######air temperature (degrees Celsius x 10)quality codedew point temperature (degrees Celsius x 10)quality codeatmospheric pressure (hectopascals x 10)quality code% ls raw/1990 0.gz16 Chapter 2: MapReduce

Analyzing the Data with Unix Tools#!/usr/bin/env bashfor year in all/*doecho -ne basename year .gz "\t"gunzip -c year \awk '{ temp substr( 0, 88, 5) 0;q substr( 0, 93, 1);if (temp ! 9999 && q /[01459]/ && temp max) max temp }END { print max }'doneEND% ./max Analyzing the Data with Unix Tools 17

Analyzing the Data with HadoopMap and Reduce18 Chapter 2: MapReduce

0067011990999991950051507004.9999999N9 00001 99999999999.0043011990999991950051512004.9999999N9 00221 -00111 99999999999.0043012650999991949032412004.0500001N9 01111 99999999999.0043012650999991949032418004.0500001N9 00781 99999999999.(0, 0067011990999991950051507004.9999999N9 00001 99999999999.)(106, 0043011990999991950051512004.9999999N9 00221 99999999999.)(212, 0043011990999991950051518004.9999999N9-00111 99999999999.)(318, 0043012650999991949032412004.0500001N9 01111 99999999999.)(424, 0043012650999991949032418004.0500001N9 00781 99999999999.)(1950,(1950,(1950,(1949,(1949,0)22) 11)111)78)(1949, [111, 78])(1950, [0, 22, 11])(1949, 111)(1950, 22)Analyzing the Data with Hadoop 19

Java MapReduceMappermap()import ;public class MaxTemperatureMapper extends MapReduceBaseimplements Mapper LongWritable, Text, Text, IntWritable {private static final int MISSING 9999;public void map(LongWritable key, Text value,OutputCollector Text, IntWritable output, Reporter reporter)throws IOException {}}String line value.toString();String year line.substring(15, 19);int airTemperature;if (line.charAt(87) ' ') { // parseInt doesn't like leading plus signsairTemperature Integer.parseInt(line.substring(88, 92));} else {airTemperature Integer.parseInt(line.substring(87, 92));}String quality line.substring(92, 93);if (airTemperature ! MISSING && quality.matches("[01459]")) {output.collect(new Text(year), new IntWritable(airTemperature));}20 Chapter 2: MapReduce

Mapperorg.apache.hadoop.ioLong eReducerimport java.io.IOException;import d.Reporter;public class MaxTemperatureReducer extends MapReduceBaseimplements Reducer Text, IntWritable, Text, IntWritable {public void reduce(Text key, Iterator IntWritable values,OutputCollector Text, IntWritable output, Reporter reporter)throws IOException {}}int maxValue Integer.MIN VALUE;while (values.hasNext()) {maxValue Math.max(maxValue, values.next().get());}output.collect(key, new bleAnalyzing the Data with Hadoop 21

import blic class MaxTemperature {public static void main(String[] args) throws IOException {if (args.length ! 2) {System.err.println("Usage: MaxTemperature input path output path ");System.exit(-1);}JobConf conf new JobConf(MaxTemperature.class);conf.setJobName("Max temperature");FileInputFormat.addInputPath(conf, new , new ()FileInputFormataddInputPath()22 Chapter 2: MapReduce

TextInputFormatrunJob()JobClientA test run% export HADOOP CLASSPATH build/classes% hadoop MaxTemperature input/ncdc/sample.txt output09/04/07 12:34:35 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName JobTracker, sessionId 09/04/07 12:34:35 WARN mapred.JobClient: Use GenericOptionsParser for parsing thearguments. Applications should implement Tool for the same.09/04/07 12:34:35 WARN mapred.JobClient: No job jar file set. User classes may notbe found. See JobConf(Class) or JobConf#setJar(String).09/04/07 12:34:35 INFO mapred.FileInputFormat: Total input paths to process : 109/04/07 12:34:35 INFO mapred.JobClient: Running job: job local 000109/04/07 12:34:35 INFO mapred.FileInputFormat: Total input paths to process : 109/04/07 12:34:35 INFO mapred.MapTask: numReduceTasks: 109/04/07 12:34:35 INFO mapred.MapTask: io.sort.mb 10009/04/07 12:34:35 INFO mapred.MapTask: data buffer 79691776/9961472009/04/07 12:34:35 INFO mapred.MapTask: record buffer 262144/32768009/04/07 12:34:35 INFO mapred.MapTask: Starting flush of map output09/04/07 12:34:36 INFO mapred.MapTask: Finished spill 009/04/07 12:34:36 INFO mapred.TaskRunner: Task:attempt local 0001 m 000000 0 isdone. And is in the process of commiting09/04/07 12:34:36 INFO mapred.LocalJobRunner: xt:0 52909/04/07 12:34:36 INFO mapred.TaskRunner: Task 'attempt local 0001 m 000000 0' done.Analyzing the Data with Hadoop 23

09/04/07 12:34:36 INFO mapred.LocalJobRunner:09/04/07 12:34:36 INFO mapred.Merger: Merging 1 sorted segments09/04/07 12:34:36 INFO mapred.Merger: Down to the last merge-pass, with 1 segmentsleft of total size: 57 bytes09/04/07 12:34:36 INFO mapred.LocalJobRunner:09/04/07 12:34:36 INFO mapred.TaskRunner: Task:attempt local 0001 r 000000 0 is done. And is in the process of commiting09/04/07 12:34:36 INFO mapred.LocalJobRunner:09/04/07 12:34:36 INFO mapred.TaskRunner: Task attempt local 0001 r 000000 0 isallowed to commit now09/04/07 12:34:36 INFO mapred.FileOutputCommitter: Saved output of task'attempt local 0001 r 000000 0' to file:/Users/tom/workspace/htdg/output09/04/07 12:34:36 INFO mapred.LocalJobRunner: reduce reduce09/04/07 12:34:36 INFO mapred.TaskRunner: Task 'attempt local 0001 r 000000 0' done.09/04/07 12:34:36 INFO mapred.JobClient: map 100% reduce 100%09/04/07 12:34:36 INFO mapred.JobClient: Job complete: job local 000109/04/07 12:34:36 INFO mapred.JobClient: Counters: 1309/04/07 12:34:36 INFO mapred.JobClient: FileSystemCounters09/04/07 12:34:36 INFO mapred.JobClient:FILE BYTES READ 2757109/04/07 12:34:36 INFO mapred.JobClient:FILE BYTES WRITTEN 5390709/04/07 12:34:36 INFO mapred.JobClient: Map-Reduce Framework09/04/07 12:34:36 INFO mapred.JobClient:Reduce input groups 209/04/07 12:34:36 INFO mapred.JobClient:Combine output records 009/04/07 12:34:36 INFO mapred.JobClient:Map input records 509/04/07 12:34:36 INFO mapred.JobClient:Reduce shuffle bytes 009/04/07 12:34:36 INFO mapred.JobClient:Reduce output records 209/04/07 12:34:36 INFO mapred.JobClient:Spilled Records 1009/04/07 12:34:36 INFO mapred.JobClient:Map output bytes 4509/04/07 12:34:36 INFO mapred.JobClient:Map input bytes 52909/04/07 12:34:36 INFO mapred.JobClient:Combine input records 009/04/07 12:34:36 INFO mapred.JobClient:Map output records 509/04/07 12:34:36 INFO mapred.JobClient:Reduce input records 5hadoophadoopjavaHADOOP CLASSPATHhadoopHADOOP CLASSPATHjob local 0001attempt local 0001 m 000000 024 Chapter 2: MapReduce

attempt local 0001 r 000000 0% cat output/part-000001949111195022The new Java MapReduce eportermap()Analyzing the Data with Hadoop 25

mperaturepublic class NewMaxTemperature {static class NewMaxTemperatureMapperextends Mapper LongWritable, Text, Text, IntWritable {private static final int MISSING 9999;public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {}}String line value.toString();String year line.substring(15, 19);int airTemperature;if (line.charAt(87) ' ') { // parseInt doesn't like leading plus signsairTemperature Integer.parseInt(line.substring(88, 92));} else {airTemperature Integer.parseInt(line.substring(87, 92));}String quality line.substring(92, 93);if (airTemperature ! MISSING && quality.matches("[01459]")) {context.write(new Text(year), new IntWritable(airTemperature));}static class NewMaxTemperatureReducerextends Reducer Text, IntWritable, Text, IntWritable {public void reduce(Text key, Iterable IntWritable values,Context context)throws IOException, InterruptedException {}}int maxValue Integer.MIN VALUE;for (IntWritable value : values) {maxValue Math.max(maxValue, value.get());}context.write(key, new IntWritable(maxValue));26 Chapter 2: MapReduce

public static void main(String[] args) throws Exception {if (args.length ! 2) {System.err.println("Usage: NewMaxTemperature input path output path ");System.exit(-1);}Job job new ileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new em.exit(job.waitForCompletion(true) ? 0 : 1);Scaling OutData FlowScaling Out 27

28 Chapter 2: MapReduce

Combiner FunctionsScaling Out 29

30 Chapter 2: MapReduce

(1950, 0)(1950, 20)(1950, 10)(1950, 25)(1950, 15)(1950, [0, 20, 10, 25, 15])(1950, 25)(1950, [20, 25])max(0, 20, 10, 25, 15) max(max(0, 20, 10), max(25, 15)) max(20, 25) 25mean(0, 20, 10, 25, 15) 14mean(mean(0, 20, 10), mean(25, 15)) mean(10, 20) 15Specifying a combiner functionReducerMaxTemperatureReducerJobConfScaling Out 31

public class MaxTemperatureWithCombiner {public static void main(String[] args) throws IOException {if (args.length ! 2) {System.err.println("Usage: MaxTemperatureWithCombiner input path " " output path ");System.exit(-1);}JobConf conf new JobName("Max temperature");FileInputFormat.addInputPath(conf, new , new ient.runJob(conf);Running a Distributed MapReduce JobHadoop Streaming32 Chapter 2: MapReduce

Ruby#!/usr/bin/env rubySTDIN.each line do line val lineyear, temp, q val[15,4], val[87,5], val[92,1]puts "#{year}\t#{temp}" if (temp ! " 9999" && q Hadoop Streaming 33

% cat input/ncdc/sample.txt src/main/ch02/ruby/max temperature map.rb1950 00001950 00221950-00111949 01111949 0078#!/usr/bin/env rubylast key, max val nil, 0STDIN.each line do line key, val line.split("\t")if last key && last key ! keyputs "#{last key}\t#{max val}"last key, max val key, val.to ielselast key, max val key, [max val, val.to i].maxendendputs "#{last key}\t#{max val}" if last keylast key&& last key ! key% cat input/ncdc/sample.txt src/main/ch02/ruby/max temperature map.rb \sort src/main/ch02/ruby/max temperature reduce.rb194911119502234 Chapter 2: MapReduce

hadoopjar% hadoop jar HADOOP INSTALL/contrib/streaming/hadoop-*-streaming.jar \-input input/ncdc/sample.txt \-output output \-mapper src/main/ch02/ruby/max temperature map.rb \-reducer src/main/ch02/ruby/max temperature reduce.rb-combiner% hadoop jar HADOOP INSTALL/contrib/streaming/hadoop-*-streaming.jar \-input input/ncdc/all \-output output \-mapper "ch02/ruby/max temperature map.rb sort ch02/ruby/max temperature reduce.rb" \-reducer src/main/ch02/ruby/max temperature reduce.rb \-file src/main/ch02/ruby/max temperature map.rb \-file src/main/ch02/ruby/max temperature reduce.rb-filePython#!/usr/bin/env pythonimport reimport sysfor line in sys.stdin:val line.strip()(year, temp, q) (val[15:19], val[87:92], val[92:93])Hadoop Streaming 35

if (temp ! " 9999" and re.match("[01459]", q)):print "%s\t%s" % (year, temp)#!/usr/bin/env pythonimport sys(last key, max val) (None, 0)for line in sys.stdin:(key, val) line.strip().split("\t")if last key and last key ! key:print "%s\t%s" % (last key, max val)(last key, max val) (key, int(val))else:(last key, max val) (key, max(max val, int(val)))if last key:print "%s\t%s" % (last key, max val)% cat input/ncdc/sample.txt src/main/ch02/python/max temperature map.py \sort src/main/ch02/python/max temperature reduce.py1949111195022Hadoop Pipes#include algorithm #include limits #include string #include "hadoop/Pipes.hh"#include "hadoop/TemplateFactory.hh"#include "hadoop/StringUtils.hh"class MaxTemperatureMapper : public HadoopPipes::Mapper ext& context) {36 Chapter 2: MapReduce

}void map(HadoopPipes::MapContext& context) {std::string line context.getInputValue();std::string year line.substr(15, 4);std::string airTemperature line.substr(87, 5);std::string q line.substr(92, 1);if (airTemperature ! " 9999" &&(q "0" q "1" q "4" q "5" q "9")) {context.emit(year, airTemperature);}}};class MapTemperatureReducer : public HadoopPipes::Reducer text& context) {}void reduce(HadoopPipes::ReduceContext& context) {int maxValue INT MIN;while (context.nextValue()) {maxValue std::max(maxValue, ext.emit(context.getInputKey(), HadoopUtils::toString(maxValue));}};int main(int argc, char *argv[]) {return HadoopPipes::runTask(HadoopPipes::TemplateFactory MaxTemperatureMapper,MapTemperatureReducer tilsMaxTemperatureMapperairTemperaturemap()Hadoop Pipes 37

main()ReducerrunTask()Mapper ReducerHadoopPipes::runTaskMapperFactoryCompiling and RunningCC g CPPFLAGS -m32 -I (HADOOP INSTALL)/c / (PLATFORM)/includemax temperature: max temperature.cpp (CC) (CPPFLAGS) -Wall -L (HADOOP INSTALL)/c / (PLATFORM)/lib -lhadooppipes \-lhadooputils -lpthread -g -O2 -o @HADOOP INSTALLPLATFORM% export PLATFORM Linux-i386-32% makemax temperature% hadoop fs -put max temperature bin/max temperature% hadoop fs -put input/ncdc/sample.txt sample.txt38 Chapter 2: MapReduce

pipes-program% hadoop pipes \-D hadoop.pipes.java.recordreader true \-D hadoop.pipes.java.recordwriter true \-input sample.txt \-output output \-program bin/max pipes.java.recordreadertrueHadoop Pipes 39

CHAPTER 3The Hadoop Distributed FilesystemThe Design of HDFS41

HDFS ConceptsBlocks42 Chapter 3: The Hadoop Distributed Filesystem

Why Is a Block in HDFS So Large?HDFS Concepts 43

fsck% hadoop fsck -files -blocksNamenodes and Datanodes44 Chapter 3: The Hadoop Distributed Filesystem

The Command-Line onBasic Filesystem Operationshadoop fs -help% hadoop fs -copyFromLocal input/docs/quangle.txt hdfs://localhost/user/tom/quangle.txtThe Command-Line Interface 45

fs-copyFromLocalhdfs://localhost% hadoop fs -copyFromLocal input/docs/quangle.txt /user/tom/quangle.txt% hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txt% hadoop fs -copyToLocal quangle.txt quangle.copy.txt% md5 input/docs/quangle.txt quangle.copy.txtMD5 (input/docs/quangle.txt) a16f231da6b05e2ba7a339320e7dacd9MD5 (quangle.copy.txt) a16f231da6b05e2ba7a339320e7dacd9% hadoop fs -mkdir books% hadoop fs -ls .Found 2 itemsdrwxr-xr-x- tom supergroup-rw-r--r-1 tom supergroup0 2009-04-02 22:41 /user/tom/books118 2009-04-02 22:29 /user/tom/quangle.txtls -l46 Chapter 3: The Hadoop Distributed Filesystem

File Permissions in HDFSrwxdfs.permissionsHadoop Filesystemsorg.apache.hadoop.fs.FileSystemHadoop Filesystems 47

FilesystemURI schemeJava implementation (all under SystemA filesystem for a locally connected disk with client-side checksums. Use RawLocalFileSystem for a local filesystem with nochecksums. See “LocalFileSystem” on page 76.HDFShdfshdfs.DistributedFileSystemHadoop’s distributed filesystem.HDFS is designed to work efficiently in conjunction with MapReduce.HFTPhftphdfs.HftpFileSystemA filesystem providing read-onlyaccess to HDFS over HTTP. (Despiteits name, HFTP has no connectionwith FTP.) Often used with distcp(“Parallel Copying withdistcp” on page 70) to copy databetween HDFS clusters runningdifferent versions.HSFTPhsftphdfs.HsftpFileSystemA filesystem providing read-onlyaccess to HDFS over HTTPS. (Again,this has no connection with FTP.)HARharfs.HarFileSystemA filesystem layered on anotherfilesystem for archiving files. Hadoop Archives are typically usedfor archiving files in HDFS to reducethe namenode’s memory usage.See “Hadoop Archives” on page 71.KFS (CloudStore)kfsfs.kfs.KosmosFileSystemCloudStore (formerly Kosmos filesystem) is a distributed filesystem like HDFS or Google’s GFS,written in C . Find more information about it at ileSystemA filesystem backed by an FTPserver.S3 (native)s3nfs.s3native.NativeS3FileSystemA filesystem backed by AmazonS3. See http://wiki.apache.org/hadoop/AmazonS3.S3 (blockbased)s3fs.s3.S3FileSystemA filesystem backed by AmazonS3, which stores files in blocks(much like HDFS) to overcome S3’s5 GB file size limit.48 Chapter 3: The Hadoop Distributed Filesystem

% hadoop fs -ls file:///InterfacesFileSystemThriftWritableHadoop Filesystems 49

CFileSystemFUSElsWebDAVOther HDFS Interfaces50 Chapter 3: The Hadoop Distributed Filesystemcat

HftpFileSystemHsftpFileSystemFTPFileSystemThe Java Reading Data from a Hadoop URLjava.net.URLInputStream in null;try {in new URL("hdfs://host/path").openStream();// process in} finally FactoryFileSystemThe Java Interface 51

catpublic class URLCat {static {URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());}}public static void main(String[] args) throws Exception {InputStream in null;try {in new URL(args[0]).openStream();IOUtils.copyBytes(in, System.out, 4096, false);} finally tcopyBytesSystem.out% hadoop URLCat hdfs://localhost/user/tom/quangle.txtOn the top of the Crumpetty TreeThe Quangle Wangle sat,But his face you could not see,On account of his Beaver Hat.Reading Data Using the FileSystem lePath52 Chapter 3: The Hadoop Distributed Filesystem

FileSystemFileSystempublic static FileSystem get(Configuration conf) throws IOExceptionpublic static FileSystem get(URI uri, Configuration conf) throws ic FSDataInputStream open(Path f) throws IOExceptionpublic abstract FSDataInputStream open(Path f, int bufferSize) throws IOExceptionpublic class FileSystemCat {}public static void main(String[] args) throws Exception {String uri args[0];Configuration conf new Configuration();FileSystem fs FileSystem.get(URI.create(uri), conf);InputStream in null;try {in fs.open(new Path(uri));IOUtils.copyBytes(in, System.out, 4096, false);} finally {IOUtils.closeStream(in);}}% hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txtOn the top of the Crumpetty TreeThe Quangle Wangle sat,But his face you could not see,On account of his Beaver Hat.The Java Interface 53

tStreamjava.io.DataInputStreampackage org.apache.hadoop.fs;public class FSDataInputStream extends DataInputStreamimplements Seekable, PositionedReadable {// implementation elided}SeekablegetPos()public interface Seekable {void seek(long pos) throws IOException;long getPos() throws IOException;boolean seekToNewSource(long targetPos) throws treamseek()seekToNewSource()targetPospublic class FileSystemDoubleCat {public static void main(String[] args) throws Exception {String uri args[0];Configuration conf new Configuration();FileSystem fs FileSystem.get(URI.create(uri), conf);FSDataInputStream in null;try {in fs.open(new Path(uri));IOUtils.copyBytes(in, System.out, 4096, false);in.seek(0); // go back to the start of the fileIOUtils.copyBytes(in, System.out, 4096, false);} finally {IOUtils.closeStream(in);}54 Chapter 3: The Hadoop Distributed Filesystem

}}% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txtOn the top of the Crumpetty TreeThe Quangle Wangle sat,But his face you could not see,On account of his Beaver Hat.On the top of the Crumpetty TreeThe Quangle Wangle sat,But his face you could not see,On account of his Beaver Hat.FSDataInputStreamPositionedReadablepublic interface PositionedReadable {public int read(long position, byte[] buffer, int offset, int length)throws IOException;public void readFully(long position, byte[] buffer, int offset, int length)throws IOException;}public void readFully(long position, byte[] buffer) throws ekablelong oldPos getPos();try {seek(position);// read data} finally {seek(oldPos);}seek()The Java Interface 55

Writing DataFileSystemPathpublic FSDataOutputStream create(Path f) throws IOExceptioncreate()exists()Progressablepackage org.apache.hadoop.util;public interface Progressable {public void progress();}append()public FSDataOutputStream append(Path f) throws IOExceptionprogress()public class FileCopyWithProgress {public static void main(String[] args) throws Exception {String localSrc args[0];String dst args[1];56 Chapter 3: The Hadoop Distributed Filesystem

InputStream in new BufferedInputStream(new FileInputStream(localSrc));Configuration conf new Configuration();FileSystem fs FileSystem.get(URI.create(dst), conf);OutputStream out fs.create(new Path(dst), new Progressable() {public void progress() {System.out.print(".");}});}}IOUtils.copyBytes(in, out, 4096, true);% hadoop FileCopyWithProgress input/docs/1400-8.txt FSDataOutputStreampackage org.apache.hadoop.fs;public class FSDataOutputStream extends DataOutputStream implements Syncable {public long getPos() throws IOException {// implementation elided}// implementation elided}FSDataInputStream FSDataOutputStreamDirectoriesFileSystempublic boolean mkdirs(Path f) throws IOExceptionjava.io.File mkdirs()trueThe Java Interface 57

create()Querying the FilesystemFile metadata: tatuspublic class ShowFileStatusTest {private MiniDFSCluster cluster; // use an in-process HDFS cluster for testingprivate FileSystem fs;@Beforepublic void setUp() throws IOException {Configuration conf new Configuration();if (System.getProperty("test.build.data") null) {System.setProperty("test.build.data", "/tmp");}cluster new MiniDFSCluster(conf, 1, true, null);fs cluster.getFileSystem();OutputStream out fs.create(new TF-8"));out.close();}@Afterpublic void tearDown() throws IOException {if (fs ! null) { fs.close(); }if (cluster ! null) { cluster.shutdown(); }}@Test(expected FileNotFoundException.class)public void throwsFileNotFoundForNonExistentFile() throws IOException {fs.getFileStatus(new Path("no-such-file"));}@Testpublic void fileStatusForFile() throws IOException {Path file new Path("/dir/file");FileStatus stat oUri().getPath(), is("/dir/file"));assertThat(stat.isDir(), is(false));assertThat(stat.getLen(), is(7L));58 Chapter 3: The Hadoop Distributed Filesystem

OrEqualTo(System.currentTime

Hadoop: The Definitive Guide 7RP :KLWH IRUHZRUG E\ 'RXJ &XWWLQJ Beijing