Getting Started With SAS And Hadoop PDF Free Download

2y ago

42 Views

1 Downloads

2.26 MB

50 Pages

Report/dmca

Download PDF

Transcription

Getting Started with SAS and HadoopJeff Bailey#analyticsxC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

Why Hadoop?

#analyticsxHOW MUCH DOES THIS DRIVE COST?3 TBC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHOW MUCH DOES THIS DRIVE COST?Silly, you couldn’t get a3TB drive in 1980!3 TB1980 1,312,500,000C o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHOW MUCH DOES THIS DRIVE COST?That’s 0.03 per GB!3 TBTODAY 692010 2702005 3,7202000 33,0001995 3,360,0001990 33,600,0001985 315,000,0001980 1,312,500,000C o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHOW MUCH DOES THIS DRIVE COST?That’s 0.03 per GB!TODAY 92 692010 2702005 3,7202000 33,0001995 3,360,0001985 315,000,0001980 1,312,500,000Insight: Disk1990Space is 33,600,000FREE!3 TBC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxIT’S NOT JUST ABOUT COST!How long does it taketo read 3 TB of data?3 TBC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxIT’S NOT JUST ABOUT COST!How long does it taketo read 3 TB of data?3 TB4.17 HoursC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxIT’S NOT JUST ABOUT COST!How long does it taketo read 3 TB?HoursWhat happens 3ifTByou add more4.17disks?C o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHOW LONG DOES IT TAKE TO READ A 3 TB FILE?1 disk4.17 hr100 disks1000 disks2.5 min15 secC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHOW LONG DOES IT TAKE TO READ A 3 TB FILE?1 disk4.17 hr100 disksInsight:1000 disks2.5 minMoreDisks are FASTER!15 secC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

What is Hadoop?

#analyticsxHadoop is a Storage PlatformHadoop Distributed StoragePerforms Great Data is Replicated Reasonable Cost Sits on the OS File SystemHDFSC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHadoop is a Processing PlatformHadoop MapReduce/YARN Distributed Processing Data LocalityYARN / MapReduce Usually JavaHDFSC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxApache Pig Scripting LanguageHadoop Higher level thanprogramming JavaMapReducePigYARN / MapReduceHDFS Pig Latin scripts areconverted toMapReduce jobs Great for joining data Great for transformingdataC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxClouderaHadoopApache Pig: Example Program DistributedProcessingpeople LOAD '/user/training/customers' AS (cust id, name); Data Localityorders LOAD '/user/training/orders' AS (ord id, cust id, cost);groups GROUP orders BY cust id; Map Phasetotals FOREACHgroups GENERATE group, SUM(orders.cost) AS t;YARN / MapReduceresult JOIN totals BY group, people BY cust id; Reduce PhaseDUMP result;HDFSC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxApache Hive SQL on HadoopHadoop Similar to traditionalSQLPigHive2 Reduces developmenttime Enables BI on HadoopYARN / MapReduceHDFS Schema-on-Read You choose underlyingfile formatC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxApache HiveClouderaHadoop SQL on Hadoop Similar to traditionalSELECT zipcode, SUM(cost) AS totalSQLFROM customers Reduces developmenttimeJOIN Pigorders Hive2ON (customers.cust id orders.cust id) Enables BI on Hadoop/ MapReduceWHERE YARNzipcodeLIKE '63%' Schema-on-ReadGROUP BY zipcode You choose underlyingHDFS DESC;ORDER BY totalfile formatC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxApache Impala is a SQL EngineCloudera High-performance SQLenginePigHive2YARN / MapReduceHDFSImpala Handles concurrencywell Does not rely onMapReduce Supports a dialect ofSQL very similar toHive’s 100% open source Apache LicenseC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

How can SAS Interact with Hadoop?

#analyticsxUsing Base SAS 9.4 with Hadoop#1Data FilesFILEREFHDFSData FilesC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxSAS FILENAME Statement for HadoopHadoopSASPigHive2ImpalaYARN / MapReduceHDFSC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxSAS FILENAME Statement for Hadoopoptions set SAS HADOOP CONFIG PATH "\\sashq\cdh45p1";options set SAS HADOOP JAR PATH "\\sashq\cdh45";FILENAME hdp1 hadoopSAS'test.txt';Cloudera/* Write file to HDFS */data null ;PigHive2file hdp1;put ' Test Test Test';YARN / MapReducerun;/* Read file from HDFS */data test;infile hdp1;input textline 15.;run;ImpalaHDFSC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxUsing Base SAS 9.4 with HadoopData Files#1FILEREF#2PROCHadoopHDFSData FilesMapReduce HDFS commandsHadoopC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHadoop ProcedureHadoopSASPigHive2YARN / MapReduceHDFSImpala Submit HDFS commands Submit MapReduce Jobs Submit Pig LatinprogramsC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHow Do I Submit HDFS Commands?SASfilename cfg 'C:\Hadoop cfg\cdh57.xml';Cloudera/* Copy war and peace.txt to HDFS. *//* Copy moby dick.txt Pigto HDFS.*/ ImpalaHive2proc hadoop options cfg username "sasxjb" verbose;HDFS mkdir '/user/sasxjb/Books';HDFS COPYFROMLOCAL "C:\Hadoop data\moby dick.txt"YARN / MapReduceOUT '/user/sasxjb/Books/moby dick.txt';HDFS COPYFROMLOCAL "C:\Hadoop data\war and peace.txt"HDFSOUT '/user/sasxjb/Books/war and peace.txt';run;C o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHow Do I Submit MapReduce Jobs?filename cfg 'C:\Hadoop cfg\cdh57.xml';SASClouderaproc hadoop options cfg user "sasxjb" verbose;mapreduce input '/user/sasxjb/Books/moby dick.txt'output '/user/sasxjb/outBook'PigHive2 Impalajar 'C:\Hadoop y "org.apache.hadoop.io.Text"outputvalue "org.apache.hadoop.io.IntWritable"YARN / MapReducereduce "org.apache.hadoop.examples.WordCount IntSumReducer"combine "org.apache.hadoop.examples.WordCount IntSumReducer"map "org.apache.hadoop.examples.WordCount TokenizerMapper";HDFSrun;C o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHow Do I Submit Pig Latin Programs?SASClouderafilename cfg 'C:\Hadoop cfg\cdh57.xml';PigHive2Impalaproc hadoop options cfg username "sasxjb“ verbose;pig code pigcode YARN;/ MapReducerun;HDFSC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxUsing Base SAS 9.4 with HadoopData Files#1FILEREF#2PROCHadoop#3HDFSData FilesMapReduce HDFS commandsHadoopHiveQLSAS/ACCESSHiveServer2Result setsC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxSAS/ACCESS Interface to HadoopSAS Generates HiveQLHadoop Connects via JDBCPigHive2YARN / MapReduceHDFSImpala Makes Hive tables look likeSAS data sets Bulk loads directly to HDFS Can read directly fromHDFSC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHow Does SAS/ACCESS Talk to Hadoop?proc sql;select count(*) from mycdh.customer dimwhere loyalty program 'Chocolate Club';run;?C o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHow Does SAS/ACCESS Talk to Hadoop?proc sql;select count(*) from mycdh.customer dimwhere loyalty program 'Chocolate Club';run;COUNT(*) from CUSTOMER DIM TXT 1WHERE TXT 1. loyalty program 'ChocolateselectClub'SAS Generated This SQLC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxHow Does SAS/ACCESS Talk to Hadoop?proc sql;select count(*) from mycdh.customer dimwhere loyalty program 'Chocolate Club';run;OPTIONS SASTRACE ',,,d' SASTRACELOC SASLOG NOSTSUFFIX;COUNT(*) from CUSTOMER DIM TXT 1WHERE TXT 1. loyalty program 'ChocolateselectClub'SAS Generated This SQLC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxWe Can Write Our Own HiveQL!proc sql;connect to hadoop (server quickstartuser cloudera);execute (create table store cntrow format delimitedfields terminated by '\001‘stored as parquetasselect customer rk, count(*) as totfrom order factgroup by customer rk) by hadoop;quit;Explicit Pass-ThroughC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxClouderaWhat about Apache Impala?#4PigHive2ImpalaSAS/ACCESS Interface to Impala: Connects via ODBCYARN / MapReduce Makes Hive tables look likeSAS data setsHDFS Bulk loads directly to HDFSSAS/ACCESSto ImpalaHiveQLImpalaResult setsC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

In-Database: Code Accelerator

#analyticsxWhat is SAS In-Database Code Accelerator?proc ds2 indb yes;thread tpgm / overwrite yes;method run();set hdplib.intable;output;end;endthread;run;data hdplib.outdata(overwrite yes);dcl thread tpgm hdpdata;method run();set from hdpdata;end;enddata;run;quit;SAS In-Database Code Accelerators letyou run SAS code inside Hadoop. With thisyou get: DS2 processing (modern DATA Step) More Data Types Code Packages More Programming Structures Parallel Database Operations Thread Programs Run Inside DatabaseC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxproc ds2 indb yes;thread tpgm / overwrite yes;method run();set hdplib.intable;output;end;endthread;run;data hdplib.outdata(overwrite yes);dcl thread tpgm hdpdata;method run();set from hdpdata;end;enddata;run;quit;HadoopIn-Database Code Accelerator Runs in HadoopPigHive2SAS EPYARN / MapReduceHDFSC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

In-Database: Scoring Accelerator

#analyticsxWhat is SAS In-Database Scoring Accelerator?HadoopSAS In-Database ScoringAccelerator lets you score modelsinside the cluster. With this you get:PigHive2YARN / MapReduceHDFSSAS EP Uses the SAS EmbeddedProcess Faster Scoring Less data movement – scoredata where it lives Uses fewer resourcesC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxWhat does the Scoring Process Look like?C o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

Data Loader for Hadoop

#analyticsxData Loader for Hadoop – Self Service Big Data Easy to use UI Query Data Manage Data Transform Data Run Custom Code Move DataC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

SAS Grid Manager for Hadoop

#analyticsxWhat is SAS Grid Manager for Hadoop?C o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

Servers

#analyticsxSAS Viya Hadoop ArchitectureHDFS asinfrastructureCloud Analytic Services (CAS)MicroservicesIn-Memory EngineSAS Data Connectorto HadoopSAS Data Connect Acceleratorfor HadoopHadoopC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxFeel Free to Contact 1 SAS HadoopC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

#analyticsxC o p y r ig ht 201 6, S A S I n st i t ut e I n c. A l l r ig ht s r ese rve d.

SAS FILENAME Statement for Hadoop a YARN / MapReduce HDFS Pig Hive2 Impala SAS options set SAS_HADOOP_CONFIG_PATH "\\sashq\cdh45p1"; options set SAS_HADOOP_JAR_PATH "\\sashq\cdh45"; FILENAME hdp1 hadoop 'test.txt'; /* Write file to HDFS */ data _null_; file hdp1; put ' Test Test Test';