@JamesPlusPlus Phoenix-hbase.blogspot /

Transcription

PhoenixWe put the SQL back in NoSQLhttps://github.com/forcedotcom/phoenixJames com/

Agendal What/why HBase?Completed

AgendaWhat/why HBase?l What/why Phoenix?l Completed

AgendaWhat/why HBase?l What/why Phoenix?l How does PhoenixCompleted work?l

AgendaWhat/why HBase?l What/why Phoenix?l How does PhoenixCompleted work?l Demol

AgendaWhat/why HBase?l What/why Phoenix?l How does PhoenixCompleted work?l Demol Roadmapl

AgendaWhat/why HBase?l What/why Phoenix?l How does PhoenixCompleted work?l Demol Roadmapl Q&Al

What is HBase?l Developed as part of Apache HadoopCompleted

What is HBase?Developed as part of Apache Hadoopl Runs on top of HDFSl Completed

What is HBase?Developed as part of Apache Hadoopl Runs on top of HDFSl Key/value storeCompletedl

What is HBase?Developed as part of Apache Hadoopl Runs on top of HDFSl Key/value storeCompletedl Map

What is HBase?Developed as part of Apache Hadoopl Runs on top of HDFSl Key/value storeCompletedl MapDistributed

What is HBase?Developed as part of Apache Hadoopl Runs on top of HDFSl Key/value storeCompletedl MapDistributedSparse

What is HBase?Developed as part of Apache Hadoopl Runs on top of HDFSl Key/value storeCompletedl MapSortedDistributedSparse

What is HBase?Developed as part of Apache Hadoopl Runs on top of HDFSl Key/value storeCompletedl MapSortedDistributed ConsistentSparse

What is HBase?Developed as part of Apache Hadoopl Runs on top of HDFSl Key/value storeCompletedl MapSortedDistributed ConsistentSparseMultidimensional

Cluster Architecture

Sharding

Why Use HBase?l If you have lots of dataCompleted

Why Use HBase?l If you have lots of datal Scales linearlyCompleted

Why Use HBase?l If you have lots of datal Scales linearlyl Shards automaticallyCompleted

Why Use HBase?If you have lots of datal Scales linearlyl Shards automaticallyCompletedl If you can live without transactionsl

Why Use HBase?If you have lots of datal Scales linearlyl Shards automaticallyCompletedl If you can live without transactionsl If your data changesl

Why Use HBase?If you have lots of datal Scales linearlyl Shards automaticallyCompletedl If you can live without transactionsl If your data changesl If you need strict consistencyl

What is Phoenix?Completed

What is Phoenix?l SQL skin for HBaseCompleted

What is Phoenix?SQL skin for HBasel Alternate client APIl Completed

What is Phoenix?SQL skin for HBasel Alternate client APIl Embedded JDBCCompleteddriverl

What is Phoenix?SQL skin for HBasel Alternate client APIl Embedded JDBCCompleteddriverl Runs at HBase native speedl

What is Phoenix?SQL skin for HBasel Alternate client APIl Embedded JDBCCompleteddriverl Runs at HBase native speedl Compiles SQL into native HBase callsl

What is Phoenix?SQL skin for HBasel Alternate client APIl Embedded JDBCCompleteddriverl Runs at HBase native speedl Compiles SQL into native HBase callsl So you don’t have to!l

Cluster Architecture

Cluster ArchitecturePhoenix

Cluster ArchitecturePhoenixPhoenix

Phoenix Performance

Why Use Phoenix?

Why Use Phoenix?l Give folks an API they already knowCompleted

Why Use Phoenix?Give folks an API they already knowl Reduce the amount of code neededl Completed

Why Use Phoenix?Give folks an API they already knowl Reduce the amount of code neededl CompletedSELECT TRUNC(date,'DAY’),AVG(cpu)FROM web statWHERE domain LIKE 'Salesforce%’GROUP BY TRUNC(date,'DAY’)

Why Use Phoenix?Give folks an API they already knowl Reduce the amount of code neededl Perform optimizations transparentlyl Completed

Why Use Phoenix?Give folks an API they already knowl Reduce the amount of code neededl Perform optimizations transparentlyCompletedl Aggregationl Skip Scanl Secondary indexing (soon!)l

Why Use Phoenix?Give folks an API they already knowl Reduce the amount of code neededl Perform optimizations transparentlyCompletedl Leverage existing toolingl

Why Use Phoenix?Give folks an API they already knowl Reduce the amount of code neededl Perform optimizations transparentlyCompletedl Leverage existing toolingl SQL client/terminall OLAP enginel

How Does Phoenix Work?Overlays on top of HBase Data Modell Keeps Versioned Schema Respositoryl Query ProcessorCompletedl

Phoenix Data ModelPhoenix maps HBase data model to the relational worldHBase Table

Phoenix Data ModelPhoenix maps HBase data model to the relational worldHBase TableColumn Family AColumn Family B

Phoenix Data ModelPhoenix maps HBase data model to the relational worldHBase TableColumn Family AQualifier 1Qualifier 2Column Family BQualifier 3

Phoenix Data ModelPhoenix maps HBase data model to the relational worldHBase TableColumn Family AQualifier 1Row Key 1ValueQualifier 2Column Family BQualifier 3

Phoenix Data ModelPhoenix maps HBase data model to the relational worldHBase TableColumn Family AQualifier 1Row Key 1Row Key 2Column Family BQualifier 2Qualifier 3ValueValueValue

Phoenix Data ModelPhoenix maps HBase data model to the relational worldHBase TableColumn Family AQualifier 1Row Key 1Qualifier 2Qualifier 3ValueValueValueRow Key 2Row Key 3Column Family BValue

Phoenix Data ModelPhoenix maps HBase data model to the relational worldHBase TableHBase TableColumn Family AColumn Family AQualifier 1Qualifier 2Qualifier 1Qualifier 2Row Key 1Row Key 1Row Key 2Row Key 2Row Key 3Row Key 3Column Family BColumn Family BQualifier 3Qualifier 3ValueValueValueValueValueValueValueValue

Phoenix Data ModelPhoenix maps HBase data model to the relational worldHBase TableHBase TableHBaseColumn FamilyA TableRow Key 1Row Key 1Row Key 1Row Key 2Row Key 2Row Key 2Row Key 3Row Key 3Row Key 3Column Family AColumn Family AQualifier 1Qualifier 2Qualifier 1Qualifier 2Qualifier1Qualifier mn Family BColumn Family BColumn Family BQualifier 3Qualifier 3Qualifier 3ValueValueValue

Phoenix Data ModelPhoenix maps HBase data model to the relational worldHBase TableHBase TableHBaseColumn FamilyA TableRow Key 1Row Key 1Row Key 1Row Key 2Row Key 2Row Key 2Row Key 3Row Key 3Row Key 3Column Family AColumn Family AQualifier 1Qualifier 2Qualifier 1Qualifier 2Qualifier1Qualifier 2ValueValueValueValueValueValueMultiple VersionsValueValueValueColumn Family BColumn Family BColumn Family BQualifier 3Qualifier 3Qualifier 3ValueValueValue

Phoenix Data ModelPhoenix maps HBase data model to the relational worldPhoenix TableHBase TableColumn Family AQualifier 1Row Key 1Qualifier 2Qualifier 3ValueValueValueRow Key 2Row Key 3Column Family BValue

Phoenix Data ModelPhoenix maps HBase data model to the relational worldPhoenix TableHBase TableColumn Family AQualifier 1Row Key 1Qualifier 2Qualifier 3ValueValueValueRow Key 2Row Key 3Column Family BValueKey Value Columns

Phoenix Data ModelPhoenix maps HBase data model to the relational worldPhoenix TableHBase TableColumn Family AQualifier 1Row Key 1Qualifier 2Qualifier 3ValueValueValueRow Key 2Row Key 3Column Family BValueRow Key ColumnsKey Value Columns

Phoenix Metadatal Stored in a Phoenix HBase tableCompleted

Phoenix Metadatal Stored in a Phoenix HBase tablel SYSTEM.TABLECompleted

Phoenix Metadatal l Stored in a Phoenix HBase tableUpdated through DDL commandsCompleted

Phoenix Metadatal l Stored in a Phoenix HBase tableUpdated through DDL commandsl CREATE TABLEl ALTER TABLECompletedl DROP TABLEl CREATE INDEXl DROP INDEX

Phoenix Metadatal l l Stored in a Phoenix HBase tableUpdated through DDL commandsKeeps older versions as schema evolvesCompleted

Phoenix Metadatal l l l Stored in a Phoenix HBase tableUpdated through DDL commandsKeeps older versions as schema evolvesCorrelates timestamps between schema and dataCompleted

Phoenix Metadatal l l l Stored in a Phoenix HBase tableUpdated through DDL commandsKeeps older versions as schema evolvesCorrelates timestamps between schema and dataCompletedl Flashback queries useschema that was in-place then

Phoenix Metadatal l l l l Stored in a Phoenix HBase tableUpdated through DDL commandsKeeps older versions as schema evolvesCorrelates timestamps between schema and dataCompletedAccessible via JDBC metadataAPIs

Phoenix Metadatal l l l l Stored in a Phoenix HBase tableUpdated through DDL commandsKeeps older versions as schema evolvesCorrelates timestamps between schema and dataCompletedAccessible via JDBC metadataAPIsl java.sql.DatabaseMetaDatal Through Phoenix queries!

ExampleOver metrics data for clusters of servers with a schema like this:SERVER METRICSHOSTDATERESPONSE TIMEVARCHARDATEINTEGERGC TIMECPU TIMEIO TIME INTEGERINTEGERINTEGERRow Key

ExampleOver metrics data for clusters of servers with a schema like this:SERVER METRICSHOSTDATERESPONSE TIMEVARCHARDATEINTEGERGC TIMECPU TIMEIO TIME INTEGERINTEGERINTEGERKey Values

ExampleWith 90 days of data that looks like this:SERVER METRICSHOSTDATERESPONSE TIMEsf1.s1sf1.s1 sf3.s1sf3.s1sf7.s9 Jun 5 10:10:10.234Jun 5 11:18:28.4561234Jun 5 10:10:10.234Jun 6 12:46:19.123Jun 4 08:23:23.4562345GC TIME8012500223401234

ExampleWalk through query processing for three scenarios

ExampleWalk through query processing for three scenarios1. Chart Response Time Per Cluster

ExampleWalk through query processing for three scenarios1. Chart Response Time Per Cluster2. Identify 5 Longest GC Times

ExampleWalk through query processing for three scenarios1. Chart Response Time Per Cluster2. Identify 5 Longest GC Times3. Identify 5 Longest GC Times again and again

Scenario 1Chart Response Time Per ClusterSELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)

Scenario 1Chart Response Time Per ClusterSELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)

Scenario 1Chart Response Time Per ClusterSELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)

Scenario 1Chart Response Time Per ClusterSELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)

Scenario 1Chart Response Time Per ClusterSELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)

Step 1: ClientIdentify Row Key Ranges from QuerySELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3’, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)Row Key RangesHOSTDATE

Step 1: ClientIdentify Row Key Ranges from QuerySELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3’, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)Row Key RangesHOSTDATE

Step 1: ClientIdentify Row Key Ranges from QuerySELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3’, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)Row Key RangesHOSTDATE

Step 1: ClientIdentify Row Key Ranges from QuerySELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3’, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)Row Key RangesHOSTDATEsf1

Step 1: ClientIdentify Row Key Ranges from QuerySELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3’, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)Row Key RangesHOSTDATEsf1sf3

Step 1: ClientIdentify Row Key Ranges from QuerySELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3’, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)Row Key RangesHOSTDATEsf1sf3sf7

Step 1: ClientIdentify Row Key Ranges from QuerySELECT substr(host,1,3), trunc(date,’DAY’), avg(response time)FROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3’, ‘sf7’)CompletedGROUP BY substr(host, 1, 3), trunc(date,’DAY’)Row Key RangesHOSTDATEsf1t1 - *sf3sf7

Step 2: ClientOverlay Row Key Ranges with RegionsR1sf1R2Completedsf4sf1sf3R3sf6R4sf7

Step 3: ClientExecute Parallel R4sf7

Step 4: ServerFilter using Skip Scansf1.s1SKIPCompletedt0

Step 4: ServerFilter using Skip ScanINCLUDECompletedsf1.s1t1

Step 4: ServerFilter using Skip Scansf1.s2SKIPCompletedt0

Step 4: ServerFilter using Skip ScanINCLUDECompletedsf1.s2t1

Step 4: ServerFilter using Skip ScanSKIPsf1.s3t0

Step 4: ServerFilter using Skip ScanINCLUDEsf1.s3t1

Step 5: ServerIntercept Scan in CoprocessorSERVER METRICSSERVER METRICSHOSTDATEHOSTDATEAGGsf1.s1Jun 2 10:10:10.234sf1Jun 1 sf1.s2Jun 3 23:05:44.975sf1Jun 2 sf1.s2Jun 9 08:10:32.147sf1Jun 3 sf1.s3Jun 1 11:18:28.456sf1Jun 8 sf1.s3Jun 3 22:03:22.142sf1Jun 9 sf1.s4Jun 1 10:29:58.950sf1.s4Jun 2 14:55:34.104sf1.s4Jun 3 12:46:19.123sf1.s5Jun 8 08:23:23.456sf1.s6Jun 1 10:31:10.234

Step 6: ClientPerform Final Merge SortR1R2SERVER METRICSscan1Completedscan2R3R4scan3HOSTDATE AGGsf1Jun 5 sf1Jun 9 sf3Jun 1 sf3Jun 2 sf7Jun 1 sf7Jun 8

Scenario 2Find 5 Longest GC TimesSELECT host, date, gc timeFROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3, ‘sf7’)CompletedORDER BY gc time DESCLIMIT 5

Scenario 2Find 5 Longest GC Times Same client parallelization and server skip scan filtering

Scenario 2Find 5 Longest GC Times Same client parallelization and server skip scan filteringServer holds 5 longest GC TIME value for each scanCompletedR1SERVER METRICSHOSTDATEGC TIMEsf1.s1Jun 2 10:10:10.23422123sf1.s1Jun 3 23:05:44.97519876sf1.s1Jun 9 08:10:32.14711345sf1.s2Jun 1 11:18:28.45610234sf1.s2Jun 3 22:03:22.14210111

Scenario 2Find 5 Longest GC Times Same client parallelization and server skip scan filteringServer holds 5 longest GC TIME value for each scanClient performs final merge sort among parallel scansSERVER METRICSScan1Scan2Scan3HOSTDATEGC TIMEsf1.s1Jun 2 10:10:10.23422123sf1.s1Jun 3 23:05:44.97519876sf1.s1Jun 9 08:10:32.14711345sf1.s2Jun 1 11:18:28.45610234sf1.s2Jun 3 22:03:22.14210111

Scenario 3Find 5 Longest GC TimesCREATE INDEX gc time indexON server metrics (gc time DESC, date DESC)INCLUDE (host, response time)Completed

Scenario 3Find 5 Longest GC TimesCREATE INDEX gc time indexON server metrics (gc time DESC, date DESC)INCLUDE (host, response time)Completed

Scenario 3Find 5 Longest GC TimesCREATE INDEX gc time indexON server metrics (gc time DESC, date DESC)INCLUDE (host, response time)Completed

Scenario 3Find 5 Longest GC TimesCREATE INDEX gc time indexON server metrics (gc time DESC, date DESC)INCLUDE (host, response time)CompletedGC TIME INDEXGC TIMEINTEGERDATEDATEHOSTVARCHARRESPONSE TIME INTEGERRow Key

Scenario 3Find 5 Longest GC TimesCREATE INDEX gc time indexON server metrics (gc time DESC, date DESC)INCLUDE (host, response time)CompletedGC TIME INDEXGC TIMEINTEGERDATEDATEHOSTVARCHARRESPONSE TIME INTEGERKey Value

Scenario 3Find 5 Longest GC TimesSELECT host, date, gc timeFROM server metricsWHERE date CURRENT DATE() – 7AND substr(host, 1, 3) IN (‘sf1’,‘sf3, ‘sf7’)CompletedORDER BY gc time DESCLIMIT 5

DemoPhoenix Stock Analyzerl Fortune 500 companiesl 10 years of historical stock pricesCompletedl Demonstrates Skip Scan in actionl Running locally on my single nodelaptop clusterl

Phoenix RoadmapSecondary Indexingl Count distinct and percentilel Derived tablesCompletedl Hash Joinsl Apache Drill integrationl Cost-based query optimizerl OLAP extensionsl Transactionsl

Thank you!Questions/comments?

Developed as part of Apache Hadoop ! Runs on top of HDFS ! Key/value store Map Sorted Distributed Consistent Sparse Multidimensional. Cluster Architecture . Sharding . Why Use HBase? Completed ! If you have lots of data . Why Use HBase?