Integrating SAS, Apache Hadoop And An Enterprise Data .

Transcription

Session: 3973 – Integrating SAS,Apache Hadoop, and anEnterprise Data Warehouse in aSingle SolutionBob Matsey – Teradata Senior Advanced Analytic Consultant1 2015 Teradata

Agenda SAS & Teradata Partnership Benefits of In Database– Coding Example Customer Improvement Examples VIYA Integration with Teradata Teradata’s UDA Agile Analytics with Data Labs In-DB Decision management with Decision Manager IoT Example – Wearables Questions?2

The SAS & Teradata Partnership Overview Teradata is an Authorized Global Reseller of SAS Solutions Partnership began in 2007 to improving analytic performance Focus on joint product collaboration and customer success More than 450 sales to over 240 customers already Teradata has dedicated R&D teams onsite at SAS Regular collaboration on Joint Product Roadmapto ensure seamless product integration3

Example of In Database with Proc FREQSAS SessionTraditionalTechniqueSQLPushdownProc Freq;table state*credit;SAS/Access to Teradata Request all rows Select state, creditfrom credit data; Calculatefrequency count Select count(*),state, credit from . . .group by state,credit; Return only countSQLSelectSQLSelectTeradataNode Node Node Node Node Node Node Node Node Node Node Node Node Node Node NodeAmps Amps Amps Amps Amps Amps Amps Amps Amps Amps Amps Amps Amps Amps Amps AmpsTraditionalSQL PushdownRows Returned9,000,000Time to Process55 seconds4512 seconds

In Database Coding ExampleTesting In-database FunctionalityNot Running In Database Example: ( SQLGENERATION NONE;) will tell the code to NOT run In database.Example 1 – Shows running a simple Proc Freq in a SAS program against a larger dataset ( at least 1- 2 million rows) without in-database capabilities turned on& with SAS log turned on. Then review the SAS log for duration and database performanceCode Example:12 libname tdXXXX teradata server "XXXserver" database XXXXP user &user password &password;1317 options sastrace (,,,ds) sastraceloc saslog nostsuffix;20 OPTIONS SQLGENERATION NONE;21PROC FREQ DATA tdxxxx.xxxxx;22TABLES XXXX XXXX;23RUN;Running In Database Example: ( SQLGENRATION DBMS; ) Will tell the code to run In database2nd Example is: Running the same Proc Freq code in a SAS program with the following options: options SQLGENERATION DBMS . This option says to runthe code In database whenever it can, so I highly recommend putting this on ALL your SAS code.12 libname tdXXXX teradata server "XXXserver" database XXXXP user &user password &password;1317 options sastrace (,,,ds) sastraceloc saslog nostsuffix;20 OPTIONS SQLGENERATION DBMS DBIDIRECTEXEC set truncate bigint 'yes' MSGLEVEL 1;21PROC FREQ DATA tdxxxx.xxxxx;22TABLES XXXX XXXX;23RUN;Running these two test will show,Example 1 – this will NOT run In database.Example 2 – will run IN database.5

In-Database FunctionalitySAS/Access to TeradataBase Procedures: PROC APPEND PROC CONTENTS PROC COPY PROC DATASETS PROC DELETE PROC FORMAT PROC FREQ PROC MEANS PROC PRINT PROC RANK PROC REPORT PROC SORT PROC SQL PROC SUMMARY PROC TABULATEDQ Accelerator for Teradata Match code Parsing/Casing Gender/Pattern/Identificationanalysis Standardization6SAS Code Accelerator for TeradataSAS Scoring Accelerator for Teradata PROC DS2 EM/STAT* ModelsSAS Analytics Accelerator for TeradataStatistical Analysis Procedures: PROC CANCORR PROC CORR PROC FACTOR PROC PRINCOMP PROC REG PROC SCORE PROC TIMESERIES PROC VARCLUSSAS Enterprise Miner PROC DMDB PROC DMINE PROC DMREG (Logistic Regression) Also nodes for Input, Sample, Partition, Filter,Merge, Expand PROC SCORE works withcoefficients from: PROC ACECLUS PROC CALIS PROC CANDISC PROC DISCRIM PROC FACTOR PROC PRINCOMP PROC TCALIS PROC VARCLUS PROC ORTHOREG PROC QUANTREG PROC REG PROC ROBUSTREG

X Faster#Process Name1Horizontalization18 hrs7 mins32 mins34 X15 hrs3 mins33 mins27 X27SAS 2 NodeTeradataSAS OracleHorizontalization3Variable Calculation6 hrs57 mins4 mins104 X4Scoring10 hrs56 mins11 mins60 X5Data Mart Generation27 hrs50 mins1 hour28 mins19 X

SAS Programs Results Highlights– GE – long running queries with sort- Execution in Teradata only took 3.75 minutes – 1600X – Old way 103 hours!– OSCAR – running against Commercial Market Scan data- Execution in Teradata was 1 hour 50 minutes against 3 times larger data set – Old way 231 hoursSAS OnlyBusiness#Line1 oscar2 GE3 ingenix4 humana5 ingenix6 ingenix7 ingenix8 ingenix9 ingenix10 ingenix11 ingenix12 ingenix13 pharmetrics14 pharmetrics15 pharmetrics16 pharmetrics17 pharmetrics18 pharmetrics8SAS Log Nameoscar mdcd v3.logmk text observation f sort.logdcf i3 qc.loghumana dups.loganalysis 100 indentifying initial patients.loganalysis 200 extracting mx claims.loganalysis 210 extracting rx claims.logdcf mk s2009 r12q2.logdcf mk s2010 r12q2.logdcf mk s2011 r12q2.logdcf mk m2011 r12q2.logdcf mk r2011 r12q2.log130 af all claims.log110 af claims.log183 table8d.log183 table8b.log162 table2b.log182 table8d.log# ofSteps94533,40128121112202020202012643393043Days Hours Minutes9.64.3231.6 30.420.623.8SAS Teradata% ofSAS X TimesDays Hours Minutes Only 24.712.721.519.39.120.37.413.2

Agile Analytics – Integrating Data into a Single SolutionUNIFIED DATA essIntelligenceFrontlineWorkersSCMINTEGRATED DATAWAREHOUSECRMImagesDATAPLATFORMorQuery GridData LabDataMiningAudioand Y PLATFORMTextMathand StatsLanguagesWeb RSANALYTICTOOLS

Dealing with All Types of DataEnabling Self Service Data Loading &Analytics with a Teradata’s Data Labs

Business Need for Agile AnalyticsFlexibility vs. IT Process Analyze quickly– Test New Theories– New Data Does the new data provide additional insight? Does the new insight cause a change in thinking or direction? Test Fast– Was the theory right? (Success or Failure) Productionize what works; discard what doesn’t!– Add the new application– Add the new data– Or delete and move on!11112018 Teradata Confidential & Proprietary

Don’t Just Use Production Data – Evolve It3rd Party Data Often rented, supplier data and/or format needs to change, valueneeds validation, only applies to some projectsTemporary & Research Data Exploratory metrics and aggregates, requirements not fully defined,short lived, early stage workPre-Production Data & Prototypes Excel Spreadsheets Oracle, SQL Server, SAS datasets, Access DB, others can be loaded Comma delimited, space delimited, other data types12122018 Teradata Confidential & Proprietary

Teradata Data Labs ArchitectureAnalytic Sandboxes with GovernanceR, Python, SPSS, SAS, SQL Data Lab(s) inside the EDW or DW Appliance to easilyjoin to production data via Views Load experimental, untested data from external sources Rapid prototyping, exploratory and experimentationanalysis Beyond a Sandbox– An architecture that enables governance Works within your current data warehouseenvironment– Data lab portlets for IT and Business analyst Self-provisioning system that simplifiesimplementation, management and useSASdataData LabscsvdataHadoopdataExternalData13Active WorkloadManagementTeradataDatabase

Teradata Data Lab HierarchyData Lab ObjectsData Lab hierarchy to manage user groups, space, and workload14DatabaseLab GroupData LabsTableDatabase wherethe lab groupresides NormalTeradata userdatabaseWorkspace allocated for agroup of users to createtheir own data labs.Groups can be arrangedby department or projectGroups can be madeprivateLab Group is a fixed sizethat’s shared by users.Workspacesallocated foranalysisDatabase tableto store thedata Can be for a singleuser or X number ofusers Data Labs expire Data Labs areallocated with a fixsize, but are elastic User can createtable and loaddata

Example: Lab Group HierarchyViewpointDW1MarketingLab Group15TeradataDatabaseSalesLab GroupData ScientistLab GroupCampaignLabDemand CurveLabRisk AnalyticsLabPromotionLabSales ForecastLabCustomerSegmentation LabCust RetentionLabHelen’s PersonalLab

SAS is Built into the Teradata Analytics PlatformTeradata’s strategy is to allow the customer to choose the tools they neSAS Viya*ApacheSpark,Tensor Flow*High-Speed FabricTeradataData StoreDataStorageS3 DataStorage*Teradata QueryGrid (Data Store Access)* future capabilities16CustomEngine*Teradata QueryGrid(Engine Access)Cross-Engine Analytic Orchestration

QUESTIONS ?17

Session: 3973 –Integrating SAS, Apache Hadoop, and an Enterprise Data Warehouse in a . Example 1 –Shows running a simple Proc Freq in a SAS program against a larger dataset ( at least 1- 2 million rows) without in-database capabilities turned on & with SAS log turned on. Then review the SAS log for duration and database performanceFile Size: 731KB