Impala First Stepsofa Oracle Expert - WordPress

Transcription

Big DataImpalaWhy do I need Impala?First steps of a Oracle ExpertAuthor: Jan Ott – Trivadis AGBASEL1BERNLAUSANNEZÜRICH2015 TrivadisBig Data - ImpalaDÜSSELDORFFRANKFURT A.M.FREIBURG I.BR.HAMBURGMÜNCHENSTUTTGARTWIEN

Our company.Trivadis is a market leader in IT consulting, system integration, solutionengineeringand the provision of IT services focusing onandtechnologiesin Switzerland, Germany, Austria and Denmark. We offer our services in thefollowing strategic business fields:OPERATIONTrivadis Services takes over the interacting operation of your IT systems.22015 Trivadis Trivadis – The Company21.11.15

With over 600 specialists and IT experts in your region.COPENHAGEN14 Trivadis branches and more than600 employees200 Service Level AgreementsHAMBURGOver 4,000 training participantsResearch and development budget:CHF 5.0 millionDÜSSELDORFFinancially self-supporting andsustainably GGZURICHBERNGENEVA3LAUSANNE2015 Trivadis Trivadis – The Company21.11.15VIENNAExperience from more than 1,900projects per year at over 800customers

Jan Ott – Who am I?25 years in IT25 years using Oracle15 years forTrivadis AG§§§§BI – DWH§§TuningSpeaker / Trainer§http://janottblog.wordpress.com/42015 TrivadisBig Data - Impala

Agenda1.2.3.4.5.6.5IntroductionFirst Steps in the Impala WorldProject – Hadoop as a file store for the DWHProject – Hadoop for archiving of a Oracle DBProject - TwitterSummary2015 TrivadisBig Data - Impala

IntroductionA few words about Big Data§§§Big DataHadoop§Impala – Why?Impala – my first steps§§§§6§§Get some data into HadoopTables in Impala§§Use SQLDiverseProject 1 – Data in Hadoop for a DWHProject 2 – Hadoop for ArchivingProject 3 – Twitter2015 TrivadisBig Data - Impala

Big Data: IntroductionBig Data - V’s – 3, 4 or 5§§§Volume – scale of dataVelocity – analysis of streaming data§Variety – different form of data§§Veracity – uncertainty of data (IBM)Value – business value (Microsoft)Hadoop and its Zoo§§§HDFS – MapReduceImpala, HBase, Hive, §ZookeeperNoSQL DatabasesArchitecture§§§7LAMBDA2015 TrivadisBig Data - ImpalaTurning Data into Insights

What is Hadoop§ a file system – HDFS§ Based on papers from Google§ Apache Open Source Project§ Goal§§§§§8FastHandles huge amount of dataHandles unstructured to fully structured dataHorizontally scalableReliable2015 TrivadisBig Data - Impala

What is Impala§ a SQL Query Engine onto HDFS – Hive§ Not an Apache Open Source Project§ Open Source – Cloudera, Oracle, Amazon§ Hive & Impala§ Impala uses the metadata store of Hive§ Goal§§§§§§9Easy to use - SQLFastHandles huge amount of dataHandles unstructured to fully structured dataHorizontally scalableReliable2015 TrivadisBig Data - Impala

Agenda1.2.3.4.5.6.10IntroductionFirst Steps in the Impala WorldProject – Hadoop as a file store for the DWHProject – Hadoop for archiving of a Oracle DBProject - TwitterSummary2015 TrivadisBig Data - Impala

First Steps§ Keep it simple§ Get some data into Hadoop§ Get some data into Impala§ Java – keep it to a minimum§ Get an environment that is setup§ Oracle VM – Big Data Light§ Pick one way to get the data into Impala§ Impala shell interface§ See SQL on a HDFS system112015 TrivadisBig Data - Impala

Pre-Requisite – Environment§ Oracle Big Data Lite§ VM§ Version 4.1§ -appliance/oraclebigdatalite-2104726.html§ Contains§§§§§§§Oracle Database 12c (12.1.0.2)Cloudera’s Distribution including Apache Hadoop (CDH5.1.2)Hadoop 2.3.0Hive 0.12.0Impala 2.1Oracle Big Data Connectors 4.0Oracle SQL Developer 4.0.3§ Oracle Virtual Box122015 TrivadisBig Data - Impala

Information about the VM§ Login§ oracle/welcome1§ Start here§ � Start§§§§Oracle DBHiveImpalaHDFS§ Your done preparing§ Oracle has a Movie example132015 TrivadisBig Data - Impala

The Steps – simple – focusImpalaTable142015 TrivadisBig Data - ImpalaSQLQuery

Impala§ Sources can be§ A delimited text file in the host file system- which will be copied into the Impala Store (HDFS)§ A delimited text file in HDFS§ Limits§§§§§§Read only or addNo UpdateNo DeleteNo Commit / RollbackNo IndexesAlways full table/file scan – or partition scan§ Uses Hive Metastore152015 TrivadisBig Data - Impala

HDFS – Command Shell§§§§§§§catchmodcplsput ct-dist/hadoopcommon/FileSystemShell.html§ Example hadoop fs –lsFound 6 itemsdrwx------ oracledrwx------ oracledrwxr-x--- oracledrwx------ oracledrwxr-xr-x- oracledrwxr-xr-x- oracle162015 TrivadisBig Data - etemp out session

Step 1 – The Data§ /home/oracle/Desktop/impala test/t10.txt§ Comma delimited§ Flat file§ Format the date so it fits Impalas date format- YYYY-MM-DD HH24:MI:SS.XXXX1,Hans,Meier,3000,1968-02-02 00:00:00,2000-01-01 00:00:00,12,Stefan,Müller,5000,1970-10-15 00:00:00,2001-07-01 00:00:00,13,Susanne,Kieser,3500,1972-03-14 00:00:00,2005-05-01 00:00:00,24,Paul,Steiner,4000,1960-07-28 00:00:00,2000-01-01 00:00:00,25,Monika,Hausmann,7000,1975-03-29 00:00:00,2000-01-01 00:00:00,36,Manuela,Ziegler,3700,1980-11-05 00:00:00,2010-01-01 00:00:00,47,Anna,Bosshard,4100,1984-11-08 00:00:00,2012-04-01 00:00:00,58,Armin,Studer,4900,1988-12-17 00:00:00,2013-05-22 00:00:00,39,Thomas,Bergmann,6000,1976-07-24 00:00:00,2012-08-15 00:00:00,510,Heiko,Zimmermann,4800,1955-04-21 00:00:00,2012-10-01 00:00:00,4172015 TrivadisBig Data - Impala

Step 2 – Get the data into HDFS§ File copy – Reference it in create table§ hive/warehouse/impala test.db/t 10'§ Create Table – Copy file to right directory§ Load Data182015 TrivadisBig Data - Impala

Impala – SQL – general§ Impala-shell§ SQL§§§§no DUALANSI SQL92 sort offNo delete/update1 file per insert§ Different Data Types§ Data Dictionary§ show tables§ describe192015 TrivadisBig Data - Impala

Hive - Metastore§ Show Tableshive SHOW tables LIKE 'dept';OKdeptTime taken: 0.03 seconds, Fetched: 1 row(s)hive § Describehive DESCRIBE me taken: 0.069 seconds, Fetched: 3 row(s)hive 202015 TrivadisBig Data - Impala

Impala – Performance§ Statistics§ Compute stats§ Show table stats§ Show column stats§ Explain Plan§ Explain212015 TrivadisBig Data - Impala

Impala – Miscellaneous§ Data Types§ BOOLEAN§ VARCHAR2 not but VARCHAR§ Oracle Connectors§ Uses Hive§ Partition aware§ Parquet§ No Index not even Hive Indexes§ Schema possible§ UDF – User Defined Functions§ C § Self written§ Impala written in C § ODBC / JDBC§§§§Business ObjectsCognosOther ToolsEveryone with SQL knowledge§ BIG§ 1 GB default file size (parquet)222015 TrivadisBig Data - Impala

Agenda1.2.3.4.5.6.23IntroductionFirst Steps in the Impala WorldProject – Hadoop as a file store for the DWHProject – Hadoop for archiving of a Oracle DBProject - TwitterSummary2015 TrivadisBig Data - Impala

Project – Hadoop as a file store for the DWH§ Move to Hadoop for delivered files§ Start collecting§ Files get copied into HDFS one to one§ No decision had to be taken- Schema – schema-less- Table design - non- § Immutable Data Store - Create and Read- No update / No delete§ Add External Tables with ORACLE SQL Connector§ Data useable§ Build Hadoop infrastructure – use Impala242015 TrivadisBig Data - Impala

Agenda1.2.3.4.5.6.25IntroductionFirst Steps in the Impala WorldProject – Hadoop as a file store for the DWHProject – Hadoop for archiving of a Oracle DBProject - TwitterSummary2015 TrivadisBig Data - Impala

Project – Hadoop for archiving of a Oracle DB§ Move to Hadoop for Archiving§ Possible to use the data§ Immutable Data Store - Create and Read- No update / No delete§ Add External Tables with ORACLE SQL Connector§ Data useable in Oracle too§ Build Hadoop infrastructure – use Impala§ Next§ Analyze Oracle Big Data Appliance- Exadata – Big Data Appliance combined “use same block structure”262015 TrivadisBig Data - Impala

Agenda1.2.3.4.5.6.27IntroductionFirst Steps in the Impala WorldProject – Hadoop as a file store for the DWHProject – Hadoop for archiving of a Oracle DBProject - TwitterSummary2015 TrivadisBig Data - Impala

Project – Twitter§ 400 – 500 Mio tweets per day§ 1 tweet contains§ Around 50 metadata pieces- Geo-location- Re-tweets- Followers§ That is about 2 A4 pages§ Twitter Sample Stream§ 1%§ 4-5 Mio tweets per day§ 50 tweets per second§ 20 other streams with defined key words§ HDFS§ 1 TB every 2 months including replication282015 TrivadisBig Data - Impala

The Lambda Architecture - adoptedBatch layerAll Data(HDFS)TwitterAPIJava APPQFD MessagingKafkaBatch (re) computeQueryFocusedDataQFD 1 QFD 2Batch viewsQFD 2Realtime IncrementQFD nImpalaCassandraRealtime viewsQFD 1ProcessStreamHadoopPre-computedViews(MapReduce) QFD nIncrementedViewsStormSpeed layer292015 TrivadisBig Data - ntWebApp

Agenda1.2.3.4.5.30IntroductionFirst Steps in the Impala WorldProject 1Project 2Summary2015 TrivadisBig Data - Impala

Summary§ Big Data Hadoop, Impala§ Impala§§§§SQL Extension for HadoopBlock Size – 1 GBNothing for small filesNo optimization with indexes§ A new World§ Impala, Hive, Hadoop and its Zoo§ Lots can be done with RDBMS§ Start to collect now312015 TrivadisBig Data - Impala

Why Impala§ SQL§ ANSI SQL 92§ No programming needed§ Speed !§ Adhoc§ Hive – batch§ It is IN MEMORY - Limit§ Not like Oracle – pin a object to memory§ Loaded during execution322015 TrivadisBig Data - Impala

Questions?Trivadis AGJan OttEuropa-Strasse 5CH-8152 Glattbrugg-ZurichTel. 41-44-808 70 20 (reception)Fax 41-44-808 70 21info@trivadis.comwww.trivadis.comTHANK YOU.BASEL33BERNLAUSANNEZÜRICH2015 TrivadisBig Data - ImpalaDÜSSELDORFFRANKFURT A.M.FREIBURG I.BR.HAMBURGMÜNCHENSTUTTGARTWIEN

Trivadis @ DOAG 20153rd Floor – next to the escalatorWe look forward to your visit.Because with Trivadis you always win J342015 Trivadis Trivadis – Das Unternehmen21.11.15

Sources§ Impala§ http://impala.io§ ntation/core/latest/topics/impala impala shell.html§ Oracle Connection to Hadoop with Oracle§ ow to load oracle tables§ Books:§ Big Data – MEAP by Nathan Marz§ Getting Started with Impala by John Russell§ Learning Cloudera Impala by Avkash Chauhan§ loudera.com2015 TrivadisBig Data - Impala

§ A few words about Big Data § Big Data § Hadoop § Impala -Why? § Impala -my first steps § Get some data into Hadoop § Tables in Impala § Use SQL § Diverse § Project 1 -Data in Hadoop for a DWH § Project 2 -Hadoop for Archiving § Project 3 -Twitter Big Data -Impala 6