WORKING WITH SAS & HADOOP

Transcription

WORKING WITH SAS & HADOOPDOUG GREENCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

AGENDAReview of the “FROM, IN & WITHIN” Hadoop integration patterns Deployment patterns for SAS HPA/LASR with Hadoop SAS/Access and SPDE on HDFS DS2 Basics for Hadoop Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

How does SAS leverage Hadoop?FROM:Pulling data back to a SAS environment for processingMoving the data out of HadoopIN:Moving the SAS workload to the dataWITHIN:Moving the SAS application to the dataCop yrig ht 2015, SAS Institute Inc. All rig hts reserv ed.Run SAS logic in the cluster– process big data with theMapReduce frameworksSAS advanced analytics running natively inside Hadoopunder the YARN resource management framework

SAS WITHIN THE HADOOP ECOSYSTEMToolsSAS StudioSAS EnterpriseGuide/Microsoft OfficeSAS EnterpriseMinerSAS DataLoader forHadoopSAS HP DataMining / StatisticsSAS VisualAnalytics/StatisticsSAS MetadataMetadataDataAccessSAS Server or SAS GridBase/SAS & SAS/ACCESS to Hadoop , IMPALA HAWQ OZIEDataProcessing,Ingestion &Advanced analyticsSAS TezSAS EmbeddedProcessAcceleratorsHCATALOGSAS PARKDistributed FileSystemCop yrig ht 2015, SAS Institute Inc. All rig hts reserv ed.SAS LASR Analytic ServerMPI BasedHDFSSAS Grid Managerfor HadoopSAS EventStreamprocessing

THE SAS LASR ANALYTIC SERVER“It is an in-memory engine specifically engineered for thedemands of interactive and iterative analytics” In-memory Fast, sub-second responses Multi-User Hundreds of concurrent users Stateless Don’t pre-compute things Interactive Instantly visualise analytical output Deployment MPP on HDFS (distributed) or SMP(single machine)Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SAS HIGH PERFORMANCE ANALYTICS (HPA)proc logistic data HDP.mydata;class A B C;model y(event ‘1’) A B B*C;run;proc hplogistic data HDP.mydata;class A B C;model y(event ‘1’) A B B*C;run;Single / Multi-threadedNot aware of distributed computingenvironmentComputes locally / where calledFetches Data as requiredMemory still a constraintMassively Parallel (MPP)Two degress of ParrelalismUses distributed computing environmentComputes in massively distributed modeWork is co-located with dataIn-Memory Analytics40 nodes x 96GB almost 4TB of memoryCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

LASR VS HPALASRHPAPublicPrivate(Data persisted in-memory and shared)(each execution of the proc creates owncopy of the data in-memory. Data is notpersisted)Concurrent usersHighLowKey SASProducts Memory ModelSAS Visual Analytics/StatisticsSAS In-memory statistics Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.SAS High Performance DataMining (via EnterpriseMiner)SAS High Performancestatistics (via EG or SASStudio)

ARCHITECTURETHREE DEPLOYMENT OPTIONS FOR HPA/LASRSAS In-Memory(TKGrid/LASR)YARNHadoopHDFS - SASHDATSymmetricSAS TKGrid on name and all data nodesSAS In-Memory(TKGrid/LASR)YARNHadoopHDFSCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ARNHDFSAsymmetric(separate SAS cluster)Asymmetric(collocated subset)SAS TKGrid on subset of data nodes – YARN manages resources

LOADING DATA INTO HPA/LASR1. Load data from SASHDAT ifavailable (fastest)2. Load data in parallel from Hadoopcluster via SAS EP3. Serial loads via SAS/Access ok forsmall tables Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.x

SASHDAT THE SASHDAT LIBNAME STATEMENToptionnotessashdatThe SAS engine which refers toHDFSPath The hdfs pathHost The hostname of the TKGrid headnodeInstall The path where the SAS TKGridbinaries are installedCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SASHDAT SASHDAT IS A UNIDIRECTIONAL ENGINEYou can create data using the SASHDAT engine but you cannot re-read it. E.g.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SASHDAT USE PROC HPDS2 TO MANIPULATE SASHDAT DATAProc HPDS2 can be used to create a new sashdat files from a sashdat fileCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SAS/ACCESS AND SPDE ON HDFSCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SAS/ACCESS TO HADOOP Uses Existing SAS Interfaces Standard Libname syntax PROC HADOOP Datastep and Proc SQL translated to Hive Filename support Execute Pig Scripts and MapReduce Push-down of certain procedures Custom SerDe support SPDE formatsCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SAS/ACCESS TO HADOOP HIVE Data types (avoid strings, use VARCHAR for character fields) Use native Hadoop file formats (ORC, PARQUET etc.) and partition data whereappropriate Make use of supported In-database SAS procedures FREQ, MEANS, REPORT, SUMMARY/MEANS, TABULATEData integration: Use the standard SQL transformations in DI Generate explicit pass-through Create and manage SASHDAT and LASR tables using the DI transformationsCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

MAKING USE OF YARN QUEUES Setting Hive Queue: PROPERTIES option can be added to the LIBNAME statement to add properties, likemapreduce.job.queuename, to the library nqqpbys8n1j9lra8qa6q20.htm)Libname hivetez hadoop server "gbrhadoop1-01" USER sasdemoPASSWORD "{SAS002}1D57933958C580064BD3DCA81A33DFB2"port 10000 PROPERTIES 'mapreduce.job.queuename sas user queue‘DBCREATE TABLE OPTS 'STORED AS PARQUET';Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

PUSH DOWN THE SQL PROCESSING TO HIVE AS MUCH ASPOSSIBLE Avoid joining SAS data with Hive Data. It is recommended to move the SASdataset into Hive and execute the join inside Hadoop to leverage distributedprocessing Avoid using SAS functions that will bring back Hadoop data on the SASServer because the function does not exist in HIVE. E.g. datepart Use SASTRACE option to see the communication between SAS and Hadoop.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SPDE ON HDFSCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SPDE ON HDFSCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SPDE ON HDFSCan sometimes be faster than HIVE access when working with SAS : Depending on the queries (no need to deal with Hive, direct access via HDFS)Can be faster than HIVE when used as input to SAS HPA proceduresSPDE also provide some of the traditional SAS features as : EncryptionFile compressionMember-level lockingSAS indexesSAS passwordSpecial missing valuesPhysical ordering of returned observationsUser-defined formats and informatsCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SAS PROGRAMMERS LEVERAGING HADOOP USING SPD ENGINE1.1. Use PROC HADOOP to create the path onHDFS:proc hadoopusername 'Hadoop userid‘password 'Hadoop password'verbose;hdfs mkdir '/user/sasss1/spde';run;2. SPD Engine LIBNAME statement:LIBNAME MYSPDE SPDE'/user/sasss1/spde’HDFSHOST DEFAULTPARALLELWRITE YESPARALLELREAD YESACCELWHERE YES;Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.2.3.4.5.6.7.MYSPDE is the libref we reference in our SAS code to processthe SPD Engine data stored on HDFS.SPDE is the engine SPD Engine uses to process SPD Enginetables.'/user/sasss1/spde' is the path on HDFS where our SPDEngine data is stored.HDFSHOST DEFAULT To connect to the Hadoop cluster,Hadoop configuration files must be copied from the specificHadoop cluster to a physical location that the SAS clientmachine can access. The SAS environment variableSAS HADOOP CONFIG PATH must be defined and set to thelocation of the Hadoop configuration files. For completeinstructions, see the SAS Hadoop Configuration Guide for BaseSAS and SAS/ACCESS.PARALLELWRITE YES tells SPD Engine to use parallelprocessing to write data to HDFS. Note: data must beuncompressed.PARALLELREAD YES tells SPD Engine to use parallelprocessing to read data stored in HDFS. Note: data can beuncompressed, compressed or encrypted.ACCELWHERE YES tells SPD Engine, when possible, to pushall WHERE clauses down to Hadoop as MapReduce

HADOOP AND SAS FILE FORMATSEngineTypically, use for3rd party Hadoop accessHIVE yes(ORC - HDP)(Parquet - CDH)(AVRO) Data that needs to be available for processing bythe broader Hadoop ecosystemData to be processed by pushdown SQL queriesor SAS DS2.SASHDAT Persisting data on HDFS and for the fast, parallelloading of data into LASR/HPANo Migrating SAS data sets onto HDFS without codemodificationOptimised data retrieval back to SAS.Input to LASR/HPA (faster than HIVE)Very wide analytical base tablesYes – read only access via SASsupplied SerDe(only supported forsymmetric deployments)SPDE on HDFS Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

OVERVIEW OF PROCESSING OPTIONSSASProgrammingMethodSPDEHIVESASHDATLASR (SASIOLA)Proc SQL implicitYesYes – via SAS/AccessNoNo*Proc SQL explicitNoYes – via SAS/AccessNoNoData StepYes - via SAS EPYes - via SAS EPNo**YesProc DS2Yes - via SAS EPYes - via SAS EPNoNoProc HPDS2YesNoYesYes*Would work but will pull data to SAS client for processing**Can be used to create new SASHDAT datasets but not to modify dataCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SAS EMBEDDED PROCESS AND DS2Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

THE SAS EMBEDDED PROCESS: A WORD ON THE TECHNOLOGYA portable, lightweight execution container for SAS code thatmakes SAS portable and deployable on a variety of platformsproc ds2 ;/* thread eqiv to a mapper */thread map program;method run(); set dbmslib.intab;/* program statements */end; endthread; run;/* program wrapper */data hdf.data reduced;dcl thread map program map pgm; methodrun();set from map pgm threads N;/* reduce steps */ end; enddata;run; quit;1. Data Lifting2. Data Preparation3.EPData Quality4. ScoringCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

RUN FASTER. RUN EMBEDDED Efficient way to process data.Runs inside Hadoop’s MPP architecture.Moves the computation to the data.Eliminates data movement.Decreases overall processing times.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 IN 30 SECONDS Procedural programming language. Mainly focused around parallel execution. Supports ANSI SQL data types. Allows Embedded SQL as input to the program. Allows modular programming: Scope and Methods. Supports Packages and Threads.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

SAS EMBEDDED PROCESS FOR HADOOP Lightweight execution container for DS2.Written in C and Java.Runs inside a MapReduce task.Orchestrated by Hadoop MapReduce framework.Resource allocation managed by YARN.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.WHAT IS DS2?

DS2 WHAT IS DS2? DATA TYPESCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 DATA STEP SIMILARITIES/DIFFERENCESCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 DATA STEP SIMILARITIES/DIFFERENCESCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

RUNNING SAS DATA STEP & DS2 IN HADOOP THROUGHTHE CODE ACCELERATOR Key SAS options: DSACCELL ANY and DS2ACCELL ANYDS2 in Hadoop supports both HIVE and SAS SPDE tables Use proc HPDS2 to manipulate SASHDAT tablesCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 SYNTAX FRAMEWORK FOR HADOOP12345Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.1.2.3.4.5.Hadoop libnameSAS OptionsCreate thread programDS2 logicCall thread program

DS2 IN HADOOP WITH CODE ACCELERATORproc ds2 ds2accel yes;thread compute;method run();set hdfs.emp donations;total sum(jan--dec);end;endthread;data hdfs.totals;dcl thread compute t;method run();set from t;end;enddata;run; quit;Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.TotalsPART 1DS2TotalsPART 2DS2TotalsPART 3DS2TotalsPART 4DS2MAPMAPMAPMAPemployeedonationsSPLIT 1employeedonationsSPLIT 2employeedonationsSPLIT 3employeedonationsSPLIT 4

DS2 IN HADOOP WITH BY GROUP PROCESSINGproc ds2 ds2accel yes;thread compute;method run();set hdfs.emp donations;by region;if first.region then total 0;total sum(jan--dec);if last.region then output;end;endthread;data hdfs.totals;dcl thread compute t;method run();set from t;end;enddata;Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.resultsPART 1DS2REDUCEresultsPART 2DS2REDUCEresultsPART 3DS2resultsPART 4DS2REDUCEREDUCEMapReduce Shuffle/SortMAPMAPMAPMAPuserartist dataSPLIT 1userartist dataSPLIT 2userartist dataSPLIT 3userartist dataSPLIT 4

ORIGINAL DATA STEP PROGRAMCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPDS2 is a SAS procedure and is thereforeinvoked through SAS procedure syntax.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPTo run in-database, a thread program must beused. The SAS Code Accelerator enables you topublish a DS2 thread program and execute thatthread program in parallel inside Hadoop.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPUnlike Base/SAS, DS2 enables you to explicitlydeclare variables using the DECLAREstatement. Here it is declared outside of amethod so its scope is GLOBAL.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPDS2 has new data types, more akin to anRDBMS, and should be explicitly declared.E.g. VARCHAR, DOUBLE, INT, BIGINT etc.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPDROP/KEEP/RETAIN/RENAME are only valid inglobal scope. i.e. outside of a methodprogramming block.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPMethod run() is a system method – willexecute in an implicit loop for every rowof the input data. Other system methodsare init() & term()Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPThis block of code is identical to the original datastep program.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPA BY statement is required to generate HadoopREDUCE tasks. Without a BY statement, onlyMAP tasks are generated.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPEnd statement to close the run() method.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPEndthread statement to close the threadprogram.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPNow we reference the output dataset to becreated on HadoopCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPExplicitly declare the thread program andspecify a name that identifies an instance of thethread.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPUse method run() to allow the program to readfrom the thread programCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPRead the thread program by referencing thethread identifierCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPEnd statement to close the run() method.Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPThe enddata statement marks the end of a datastatementCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPThe RUN statement submits the DS2statementsCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

DS2 EQUIVALENT FOR HADOOPAs DS2 is a SAS procedure we must explicitlyquit itCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

THE SAS LOGCop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

WHAT’S HAPPENING ON THE HADOOP CLUSTER?Cop yrig ht 2014, SAS Institute Inc. All rig hts reserv ed.

machine can access. The SAS environment variable SAS_HADOOP_CONFIG_PATH must be defined and set to the location of the Hadoop configuration files. For complete instructions, see the SAS Hadoop Configuration Guide for Base SAS and SAS/ACCESS. 5. PARALLELWRITE YES tells SPD Engine to use para