Integrating R And Hadoop For Big Data Analysis

Transcription

INTEGRATING R AND HADOOP FOR BIGDATA ANALYSISBogdan Oancea"Nicolae Titulescu" University of BucharestRaluca Mariana DragoescuThe Bucharest University of Economic Studies,

BIG DATAThe term “big data” was defined as data sets ofincreasing volume, velocity and variety – 3V; Big data sizes are ranging from a few hundredsterabytes to many petabytes of data in a singledata set; Requires high computing power and largestorage devices.

BIG DATA AN OFFICIAL STATISTICSOfficial statistics need to harness the potentialof big data to derive more relevant and timelystatistics; Large data sources that can be used in officialstatistics are: Administrative data; Commercial or transactional data; Data provided by sensors; Data provided by tracking devices; Behavioral data (for example Internet searches); Data provided by social media.

BIG DATA AND OFFICIAL STATISTICS Challenges:legislative issues; maintaining the privacy of the data; financial problems regarding the cost of sourcingdata; data quality and suitability of statistical methods; technological challenges In this paper we will investigate a technologicalproblem: integrating R and Hadoop

RIs a free software package for statistics anddata visualization; Is available for UNIX, Windows and MacOS; R is used as a computational platform forregular statistics production in many officialstatistics agencies; It is used in many other sectors like finance,retail, manufacturing etc.

HADOOPIs a free software framework developed fordistributed processing of large data sets usingclusters of commodity hardware; It was developed in Java; Other languages could be used to: R, Python orRuby; Available at http://hadoop.apache.org/.

HADOOP The Hadoop framework includes: HadoopDistributed File System (HDFS); Hadoop YARN -a framework for job scheduling andcluster resource management; Hadoop MapReduce – a system for parallelprocessing of large data sets;

HADOOP The main features of the Hadoop frameworkcan be summarized as follows: Highdegree of scalability; Cost effective: it allows for massively parallelcomputing using commodity hardware; Flexibility: is able to use any type of data, structuredor not; Fault tolerance.

R AND HADOOPR and Streaming; Rhipe; Rhadoop;

R AND STREAMINGAllows users to run Map/Reduce jobs with anyscript or executable that can access standardinput/standard output; No client-side integration with R;

R AND STREAMING – AN EXAMPLE {HADOOP HOME}/bin/Hadoop jar {HADOOP HOME}/contrib/streaming/*.jar mat \-input input data.txt \-output output \-mapper /home/tst/src/map.R \-reducer /home/tst/src/reduce.R \-file /home/tst/src/map.R \-file /home/tst/src/reduce.R

R AND HADOOPThe integration of R and Hadoop usingStreaming is an easy task; Requires that R should be installed on everyDataNode of the Hadoop cluster ;

RHIPERhipe “R and Hadoop IntegratedProgramming Environment”; Provides a tight integration between R andHadoop; Allows the user to carry out data analysis of bigdata directly in R; Available at www.datadr.org.

RHIPERhipe is an R library which allows running aMapReduce job within R; Install requirements: Ron each Data Node; Protocol Buffers on each Data Node; Rhipe on each Data Node;

RHIPE – AN EXAMPLElibrary(Rhipe)rhinit(TRUE, TRUE);map -expression ( {lapply (map.values, function(mapper) )})reduce -expression(pre { },reduce { },post { },)x - rhmr(map map, reduce reduce,ifolder inputPath,ofolder outputPath,inout c('text', 'text'),jobname 'a job name'))rhex(z)

RHADOOPRHadoop is an open source project developed byRevolution Analytics; allows running a MapReduce jobs within R just likeRhipe; Consists in: plyrmr -providing common data manipulationoperations on very large data sets managed by Hadoop; rmr – a collEction of functions providing and integrationof R and MapReduce; rdfs – an interface between R and HDFS; rhbase - an interface between R and HBase;

RHADOOP – AN EXAMPLElibrary(rmr)map -function(k,v) { }reduce -function(k,vv) { }mapreduce( input ”data.txt”,output ”output”,textinputformat rawtextinputformat,map map,reduce reduce)

CONCLUSIONS Each of the approaches has benefits and limitations: R with Streaming raises no problems regarding installation;Rhipe and RHadoop requires some effort to set up thecluster;The integration with R from the client side is high for Rhipeand Rhadoop and is missing for R and Streaming.Rhipe and RHadoop allows users to define and call their ownmap and reduce functions within R;Streaming uses a command line approach where the mapand reduce functions are passed as arguments.There are other alternatives for large scale data analysis:Apache Mahout, Apache Hive, commercial versions of Rprovided by Revolution Analytics, Segue framework orORCH;

BIG DATA AN OFFICIAL STATISTICS Official statistics need to harness the potential of big data to derive more relevant and timely statistics; Large data sources that can be used in official statistics are: Administrative data; Commercial or transactional data; Data provided by sensors; Data provided by tracking devices; Behavioral data (for example Internet searches);