PIG: A Big Data Processor - MITU

Transcription

PIG: A Big Data ProcessorTushar B. Kute,http://tusharkute.com

What is Pig? Apache Pig is an abstraction over MapReduce. It is atool/platform which is used to analyze larger sets ofdata representing them as data flows. Pig is generally used with Hadoop; we can perform allthe data manipulation operations in Hadoop usingApache Pig. To write data analysis programs, Pig provides a highlevel language known as Pig Latin. This language provides various operators using whichprogrammers can develop their own functions forreading, writing, and processing data.

Apache Pig To analyze data using Apache Pig, programmersneed to write scripts using Pig Latin language. All these scripts are internally converted to Mapand Reduce tasks. Apache Pig has a component known as PigEngine that accepts the Pig Latin scripts asinput and converts those scripts intoMapReduce jobs.

Why do we need Apache Pig? Using Pig Latin, programmers can perform MapReduce taskseasily without having to type complex codes in Java. Apache Pig uses multi-query approach, thereby reducing thelength of codes. For example, an operation that would require youto type 200 lines of code (LoC) in Java can be easily done by typingas less as just 10 LoC in Apache Pig. Ultimately, Apache Pigreduces the development time by almost 16 times. Pig Latin is SQL-like language and it is easy to learn Apache Pigwhen you are familiar with SQL. Apache Pig provides many built-in operators to support dataoperations like joins, filters, ordering, etc. In addition, it alsoprovides nested data types like tuples, bags, and maps that aremissing from MapReduce.

Features of Pig Rich set of operators: It provides many operators to performoperations like join, sort, filer, etc. Ease of programming: Pig Latin is similar to SQL and it is easy to writea Pig script if you are good at SQL. Optimization opportunities: The tasks in Apache Pig optimize theirexecution automatically, so the programmers need to focus only onsemantics of the language. Extensibility: Using the existing operators, users can develop theirown functions to read, process, and write data. UDF’s: Pig provides the facility to create User-defined Functions inother programming languages such as Java and invoke or embed themin Pig Scripts. Handles all kinds of data: Apache Pig analyzes all kinds of data, bothstructured as well as unstructured. It stores the results in HDFS.

Pig vs. MapReduce

Pig vs. SQL

Pig vs. Hive

Applications of Apache Pig To process huge data sources such as web logs. To perform data processing for searchplatforms. To process time sensitive data loads.

Apache Pig – History In 2006, Apache Pig was developed as aresearch project at Yahoo, especially to createand execute MapReduce jobs on every dataset. In 2007, Apache Pig was open sourced viaApache incubator. In 2008, the first release of Apache Pig cameout. In 2010, Apache Pig graduated as anApache top-level project.

Pig Architecture

Apache Pig – Components Parser: Initially the Pig Scripts are handled by the Parser. Itchecks the syntax of the script, does type checking, and othermiscellaneous checks. The output of the parser will be a DAG(directed acyclic graph), which represents the Pig Latinstatements and logical operators. Optimizer: The logical plan (DAG) is passed to the logicaloptimizer, which carries out the logical optimizations such asprojection and pushdown. Compiler: The compiler compiles the optimized logical planinto a series of MapReduce jobs. Execution engine: Finally the MapReduce jobs are submittedto Hadoop in a sorted order. Finally, these MapReduce jobs areexecuted on Hadoop producing the desired results.

Apache Pig – Data Model

Apache Pig – Elements Atom– Any single value in Pig Latin, irrespective of theirdata, type is known as an Atom.– It is stored as string and can be used as stringand number. int, long, float, double, chararray,and bytearray are the atomic values of Pig.– A piece of data or a simple atomic value is knownas a field.– Example: ‘raja’ or ‘30’

Apache Pig – Elements Tuple– A record that is formed by an ordered set offields is known as a tuple, the fields can be of anytype. A tuple is similar to a row in a table ofRDBMS.– Example: (Raja, 30)

Apache Pig – Elements Bag– A bag is an unordered set of tuples. In other words, acollection of tuples (non-unique) is known as a bag. Eachtuple can have any number of fields (flexible schema). Abag is represented by ‘{}’. It is similar to a table in RDBMS,but unlike a table in RDBMS, it is not necessary that everytuple contain the same number of fields or that the fieldsin th same position (column) have the same type.– Example: {(Raja, 30), (Mohammad, 45)}– A bag can be a field in a relation; in that context, it isknown as inner bag.– Example: {Raja, 30, {9848022338, raja@gmail.com,}}

Apache Pig – Elements Relation– A relation is a bag of tuples. The relations in PigLatin are unordered (there is no guarantee thattuples are processed in any particular order). Map– A map (or data map) is a set of key-value pairs.The key needs to be of type chararray and shouldbe unique. The value might be of any type. It isrepresented by ‘[]’– Example: [name#Raja, age#30]

Installation of PIG

Download Download the tar.gz file of Apache Pig -0.15.0/pig-0.15.0.tar.gz

Extract and copy Extract this file using right-click - 'Extract here'option or by tar -xzvf command. Rename the created folder 'pig-0.15.0' to 'pig' Now, move this folder to /usr/lib using followingcommand: sudo mv pig//usr/lib

Edit the bashrc file Open the bashrc file:sudo gedit /.bashrc Go to end of the file and add following lines.export PIG HOME /usr/lib/pigexport PATH PATH: PIG HOME/bin Type following command to make it in effect:source /.bashrc

Start the Pig Start the pig in local mode:pig -x local Start the pig in mapreduce mode (needs hadoopdatanode started):pig -x mapreduce

Grunt shell

Data Processing with PIG

Example: movies 85,3.8,63003,Ashi hi banva ya Gharat Gharoba,1991,3.4,54206,Navra Maza Navsacha,2004,3.9,49047,De danadan,1987,3.4,56238,Gammat Jammat,1987,3.4,75639,Eka peksha ek,1990,3.2,624410,Pachhadlela,2004,3.1,6956

Load data pig -x local grunt movies LOAD'movies data.csv' USINGPigStorage(',') as(id,name,year,rating,duration) grunt dump movies;it displays the contents

Filter data grunt movies greater than 35 FILTER movies BY (float)rating 3.5; grunt dump movies greater than 35;

Store the results data grunt store movies greater than 35into 'my movies'; It stores the result in local file system directorynamed 'my movies'.

Display the result Now display the result from local file system.cat my movies/part-m-00000

Load command The load command specified only the columnnames. We can modify the statement as followsto include the data type of the columns: grunt movies LOAD'movies data.csv' USINGPigStorage(',') as (id:int,name:chararray, year:int,rating:double, duration:int);

Check the filters List the movies that were released between 1950 and1960grunt movies between 90 95 FILTERmovies by year 1990 and year 1995; List the movies that start with the Alpahbet Dgrunt movies starting with D FILTERmovies by name matches 'D.*'; List the movies that have duration greater that 2 hoursgrunt movies duration 2 hrs FILTERmovies by duration 7200;

OutputMov19 ies b9 0 etto we1 9 en95MovieW s stith ar'D ts'MoTh viesan gr2 h eato u errs

Describe DESCRIBE The schema of a relation/alias can beviewed using the DESCRIBE command:grunt DESCRIBE movies;movies: {id: int, name: chararray,year: int, rating: double, duration:int}

Foreach FOREACH gives a simple way to applytransformations based on columns. Let’s understandthis with an example. List the movie names its duration in minutesgrunt movie duration FOREACH moviesGENERATE name, (double)(duration/60); The above statement generates a new alias that hasthe list of movies and it duration in minutes. You can check the results using the DUMP command.

Output

Group The GROUP keyword is used to group fields in arelation. List the years and the number of movies releasedeach year.grunt grouped by year group moviesby year;grunt count by year FOREACHgrouped by year GENERATE group,COUNT(movies);

Output

Order by Let us question the data to illustrate the ORDER BYoperation. List all the movies in the ascending order of year.grunt desc movies by year ORDERmovies BY year ASC;grunt DUMP desc movies by year; List all the movies in the descending order of year.grunt asc movies by year ORDER moviesby year DESC;grunt DUMP asc movies by year;

Output- Ascending by yearFrom1985To2004

Limit Use the LIMIT keyword to get only a limited numberfor results from relation.grunt top 5 movies LIMIT movies 5;grunt DUMP top 10 movies;

Pig: Modes of Execution Pig programs can be run in three methods whichwork in both local and MapReduce mode. They are– Script Mode– Grunt Mode– Embedded Mode

Script mode Script Mode or Batch Mode: In script mode, pig runsthe commands specified in a script file. The followingexample shows how to run a pig programs from ascript file: vim scriptfile.pigA LOAD 'script file';DUMP A; pig x local scriptfile.pig

Grunt mode Grunt Mode or Interactive Mode: The grunt mode can alsobe called as interactive mode. Grunt is pig's interactive shell.It is started when no file is specified for pig to run. pig x localgrunt A LOAD 'grunt file';grunt DUMP A; You can also run pig scripts from grunt using run and execcommands.grunt run scriptfile.piggrunt exec scriptfile.pig

Embedded mode You can embed pig programs in Java, python andruby and can run from the same.

Example: Wordcount program Q) How to find the number of occurrences of thewords in a file using the pig script? You can find the famous word count example writtenin map reduce programs in apache website. Here wewill write a simple pig script for the word countproblem. The pig script given in next slide finds the number oftimes a word repeated in a file:

Example: text file- shivneri.txt

Example: Wordcount programlines LOAD 'shivneri.txt' AS(line:chararray);words FOREACH lines GENERATEFLATTEN(TOKENIZE(line)) as word;grouped GROUP words BY word;w count FOREACH grouped GENERATE group,COUNT(words);DUMP w count;forts.pig

Output snapshot pig-x local forts.pig

References “Programming Pig” by Alan Gates, O'ReillyPublishers. “Pig Design Patterns” by Pradeep Pasupuleti,PACKT Publishing Tutorials Point http://github.com/rohitdens http://pig.apache.org

Thank youThis presentation is created using LibreOffice Impress 4.2.8.2, can be used freely as per GNU General Public LicenseWeb @tusharkute.com

What is Pig? Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig. To write data analysis programs, Pig provides a high-