SASI, Cassandra On The Full Text Search Ride

Transcription

SASI, Cassandra on the full text search rideDuyHai DOAN – Apache Cassandra Evangelist

12345675 minutes introduction to Apache Cassandra SASI introductionSASI cluster-wideSASI local read/write pathQuery plannerSome benchmarksTake away2@doanduyhai

Trademark PolicyFrom now on Cassandra Apache Cassandra 3@doanduyhai

5 minutes introduction to ApacheCassandra @doanduyhai

The tokensRandom hash of #partition à token hash(#p)C*C*Hash: ] –x, x ]hash range: 264 valuesC*C*C*C*x 264/2C*5C*@doanduyhai

Token ranges 3x A: x, 4 x E : 0, 4 3xB: , 4 x 2x F : , 4 4 2x 4 2x xC : , 4 4 2x 3x G : , 4 4 x D: ,0 4 3x H : ,x 4 BADHEG6CF@doanduyhai

Distributed tablesCREATE TABLE users(user id int, ,PRIMARY KEY(user id)),BCADHEuser id1user id2user id3user id4user id5G7F@doanduyhai

Distributed tablesBuser id1CAHuser id3Duser id4EGF8user id2user id5@doanduyhai

Coordinator node2Responsible for handling requests (read/write)Every node can be coordinator1 masterless3BCADHE no SPOF proxy rolerequestcoordinatorG9F@doanduyhai

!"Q&A10@doanduyhai

SASI introduction@doanduyhai

What is SASI ? SSTable-Attached Secondary Index à new 2nd index impl that followsSSTable life-cycle Objective: provide more performant & capable 2nd index12@doanduyhai

Who created it ?Open-source contribution by an engineers team13@doanduyhai

Why is it better than native 2nd index ? follow SSTable life-cycle (flush, compaction, rebuild ) à more optimizednew data-strutures range query ( , , , ) possiblefull text search options14@doanduyhai

Demo15@doanduyhai

SASI cluster-wide@doanduyhai

Distributed indexOn cluster level, SASI works exactly like native 2nd indexBUKuser87user176 Cuser987AUKuser1user102 user493USuser54user483 user938DUKHuser17user409 user787EGF17@doanduyhai

Distributed search algorithmBCAD1st roundConcurrency factor 1HEcoordinatorGF18@doanduyhai

Distributed search algorithmBCADHENot enoughresults ?coordinatorGF19@doanduyhai

Distributed search algorithmBAC2nd roundConcurrency factor 2HDEcoordinatorGF20@doanduyhai

Distributed search algorithmBCADStill not enoughresults ?HEcoordinatorGF21@doanduyhai

Distributed search algorithmBCAD3rd roundConcurrency factor 4HEcoordinatorGF22@doanduyhai

Concurrency factor formula more details at: http://www.doanduyhai.com/blog/?p 1319123@doanduyhai

Caveat 1: non restrictive filtersBHit allnodeseventuallyLCADHEcoordinatorGF24@doanduyhai

Caveat 1 solution : always use LIMITBSELECT *FROM WHERE .LIMIT 1000CADHEcoordinatorGF25@doanduyhai

Caveat 2: 1-to-1 index (user email)BCADWHERE user email ‘xxx'Not foundHEcoordinatorGF26@doanduyhai

Caveat 2: 1-to-1 index (user email)BCADWHERE user email ‘xxx'Still no resultHEcoordinatorGF27@doanduyhai

Caveat 2: 1-to-1 index (user email)BCADWHERE user email ‘xxx'At best 1 user foundAt worst 0 user foundHEcoordinatorGF28@doanduyhai

Caveat 2 solution: materialized viewsFor 1-to-1 index/relationship, use materialized views insteadCREATE MATERIALIZED VIEW user by email ASSELECT * FROM usersWHERE user id IS NOT NULL and user email IS NOT NULLPRIMARY KEY (user email, user id)But range queries ( , , , ) not possible 29@doanduyhai

Caveat 3: fetch all rows for analytics use-caseBClientCADHEcoordinatorGF30@doanduyhai

Caveat 3 solution: use co-located SparkBCADLocal index queryLocal index filtering in CassandraAggregation in SparkHEGF31@doanduyhai

SASI local read/write path@doanduyhai

SASI Life-cycle: in-memoryACK the Commit log1Commit log2.Commit logn33@doanduyhai

Local write path data structuresIndex mode, data typeData structureUsagePREFIX, textGuava ConcurrentRadixTreename LIKE 'John%'CONTAINS, textGuava ConcurrentSuffixTreename LIKE ’%John%'name LIKE ’%ny’PREFIX, otherJDK ConcurrentSkipListSetage 20age 20 AND age 30SPARSE, otherJDK ConcurrentSkipListSetage 20age 20 AND age 30suitable for 1-to-N index with N 534@doanduyhai

SASI Life-cycle: flush to SSTableMemory1Table1Commit log1Commit mmit lognTable2OnDiskIndex3OnDiskIndex135@doanduyhai

SASI Life-cycle: skIndex2OnDiskIndex3New SSTableNew OnDiskIndex36@doanduyhai

Local write path summaryIndex files are built on memtable flush on compaction flushTo avoid OOM, index files are split into chunk of 1Gb for memtable flushmax compaction flush memory in mb for compaction flushà consequences: SASI has impact on write bandwidth (CPU & disk I/O)37@doanduyhai

Local read path first, optimize query using Query Planer (see later)then load chunks (4k) of index files from disk into memory perform binary search to find the indexed value(s)retrieve the corresponding partition keys and push them into the PartitionKey Cacheà Yes, currently SASI only keep partition key(s) so on wide partition it’s not veryoptimized .38@doanduyhai

OnDiskIndex filesSStable1user id4FRFRuser id1USUSuser id5OnDiskIndex1SStable2user id3UKFRuser id2B Tree-likedata structuresDEOnDiskIndex2UKDE39@doanduyhai

OnDiskIndex LayoutHeaderBlockData Block4kMultiple of 4kMeta Data InfoPointer BlockLevelsCountPointerBlock MetaData BlockMetaLevel IndexOffsetMultiple of 4k40@doanduyhai

Header Block LayoutHeader Block rtshortvariablebyte41@doanduyhai

OnDiskIndex LayoutHeaderBlockData Block4kMultiple of 4kMeta Data InfoPointer BlockLevelsCountPointerBlock MetaData BlockMetaLevel IndexOffsetMultiple of 4k42@doanduyhai

Data Block layout4kTerms CountOffset Array: [0, 10, 22, ]Term BlockPaddingTokenTree BlockPadding4kTerms CountOffset Array: [0, 23, 35, ]Term BlockPaddingTokenTree BlockPaddingTerms CountOffset Array: [0, 17, 34, ]Term Block PaddingTokenTree BlockPaddingTerms CountOffset Array: [0, 12, 28, ]Term BlockPaddingTokenTree BlockPadding43@doanduyhai

OnDiskIndex LayoutHeaderBlockData Block4kMultiple of 4kMeta Data InfoPointer BlockLevelsCountPointerBlock MetaData BlockMetaLevel IndexOffsetMultiple of 4k44@doanduyhai

Pointer Block buildingPointer RootLevel Root Pointer BlockPointer BlockN 1LastTermMLastTermM 1 Pointer BlockN 2LastTermOPointer Level 24kPointer Block1LastTerm1Data Block1LastTerm2Data Block2 Pointer Block2LastTermNData BlockN Pointer BlockNPointer Level 1Data Level4k45@doanduyhai

Binary search using OnDiskIndex filesPointer Root LevelRoot Pointer BlockPointer BlockPointer BlockPointer BlockPointer BlockPointer BlockPointer BlockData Block1Data Block2 Data Block346Pointer BlockPointer Level 3Pointer BlockPointer Level 2Pointer BlockPointer Level 1Data BlockNData Level@doanduyhai

Term Block Binary Searchval Term100 ?Term1Term25Term50Term50val Term50 ?Term75Term50Term63 Term57Term75Term100Term100val Term75 ?Term75val Term57 ?47@doanduyhai

Query Planner@doanduyhai

Query planner build predicates treepredicates push-down & re-ordering predicate fusions for ! operator49@doanduyhai

Query optimization exampleWHERE age 100 AND fname LIKE 'p%' AND fname ! 'pa%' AND age 2150@doanduyhai

Query optimization exampleAND is associativeand commutative51@doanduyhai

Query optimization example! transformed toexclusion on range scan52@doanduyhai

Query optimization exampleAND is associativeand commutative53@doanduyhai

Some benchmarks@doanduyhai

Hardware specs13 bare-metal machines 6 CPU HT (12 vcores) 64Gb RAM 4 SSDs in RAID0 for a total of 1.5TbData set 13 billions of rows 1 numerical index with 36 distinct values 2 text index with 7 distinct values 1 text index with 3 distinct values55@doanduyhai

Benchmark resultsFull table scan using co-located Spark (no LIMIT)Predicate countFetched rowsQuery time in sec136 109 98660922 781 49233031 044 5473724360 33411656@doanduyhai

Benchmark resultsFull table scan using co-located Spark (no LIMIT)Predicate countFetched rowsQuery time in sec136 109 98660922 781 49233031 044 5473724360 33411657@doanduyhai

Benchmark resultsBeware of disk space usage for full text search !!!Table albums with 110 000 records, 6.8Mb data size58@doanduyhai

Take Away@doanduyhai

SASI vs search enginesSASI vs Solr/ElasticSearch ? Cassandra is not a search engine !!! (database durability) always slower because 2 passes (SASI index read original Cassandra data) no scoring no ordering (ORDER BY) no grouping (GROUP BY) à Apache Spark for analyticsIf you don’t need the above features, SASI is for you!60@doanduyhai

SASI sweet spotsSASI is a relevant choice if you need multi criteria search and you don't need ordering/grouping/scoring you mostly need 100 to 10000 of rows for your search queries you always know the partition keys of the rows to be searched for (this one applies tonative secondary index too) you want to index static columns (SASI has no penalty since it indexes the wholepartition)61@doanduyhai

SASI blind spotsSASI is a poor choice if you have strong SLA on search latency, for example few millisecs requirement ordering of the search results is important for you62@doanduyhai

!"Q&A63@doanduyhai

Thank You@doanduyhaiduy hai.doan@datastax.com64@doanduyhai

SASI vs search engines SASI vs Solr/ElasticSearch ? Cassandra is not a search engine !!! (database durability) always slower because 2 passes (SASI index read original Cassandra data) no scoring no ordering (ORDER BY) no grouping (GROUP BY) à Apache Spark for analytics