Structured Big Data 2: NoSQL, NewSQL And Distributed SQL Systems

Transcription

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLStructured Big Data 2:NoSQL, NewSQL andDistributed SQLSystemsShiow-yang Wu (吳秀陽)CSIE, NDHU, Taiwan, ROCRecap from Last Lecture Why can’t we use traditional RDBMS?oo As data scales, RDBMS cannot handle itThe schema from RDBMS will hinder the scalabilityNeed the data model with loosen schema andhigh scalability NoSQL Want the best of both worlds NewSQLCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 2Note 1

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLSQL vs NoSQL vs NewSQLSQL: Relies solely on relational tables for storing andaccessing transactional data. Relies on basic SQL as its primary query language. Employs rigid and highly defined data schema. Minimizes redundancies via normalization. Utilizes traditional vertical scalability (up, not out). Popular SQL Databases: IBM Db2, Informix, MariaDB,Microsoft SQL Server, MySQL, Oracle Database,PostgreSQL, and SQLite.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 3SQL vs NoSQL vs NewSQLNoSQL: Relies on different models, such as key-value, document,wide-column, or graph. Relies on high performance writes and huge, horizontalscalability for big data. Does not rely on a defined schema for writing data. Supports a large variety of modern programming languages,tools, and applications. Lacks strong consistency (instead, relies on a default“eventual consistency” for higher availability). Popular NoSQL Databases: Amazon DynamoDB, Cassandra,Couchbase Server, CouchDB, HBase, MongoDB, OracleNoSQL, and Redis.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 4Note 2

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLSQL vs NoSQL vs NewSQLNewSQL: Combines relational model of SQL databases with theversatile scalability and speed of NoSQL databases. Uses cluster-native and shared-nothing architecture toprovide low latency, high read/write performance. Favors consistency over availability (thoughconfigurations can be tuned for better balance). Variety in schema management, depending on thevendor. Popular NewSQL Databases: Apache Trafodion,Altibuse, ClusterixDB, MemSQL, VoltDB, NuoDB, andTIBCO ActiveSpaces.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 5SQL vs NoSQL vs NewSQLCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 6Note 3

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLObjectives Deep dive into some NoSQL & NewSQL databases NoSQL systems to be discussed: DynamoDB Cassandra MongoDB NewSQL systems to be discussed: VoltDBNuoDB (if time)ClustrixDB (if time)Vitess (if time)CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 7What do we REALLY want?What’s wrong with SQL, NoSQL and NewSQL? Back to basics: Re-examine the fundamentalrequirements. What we REALLY want is Distributed SQL!! Google Spanner, the first of its kind Distributed SQL DBs: CockroachDBYugabyteDBSkySQLGoogle F1CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 8Note 4

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLNoSQL Databases9CSIE59830/CSIEM0410 Big Data SystemsNoSQL: The Name “SQL” Traditional relational DBMS Recognition over past decade or so:Not every data management/analysis problemis best solved using a traditional relationalDBMS “NoSQL” “No SQL” Not using traditional relational DBMS “No SQL” Don’t use SQL language “NoSQL” “Not Only SQL”CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 10Note 5

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLWhat’s Wrong with RDBMS Nothing. One size fits all? Not really.Impedance mismatch. Object Relational Mapping doesn't work quite well. Rigid schema design.Harder to scale.Replication.Joins across multiple nodes? Hard.How does RDMS handle data growth? Hard.Need for a DBA.Many programmers are already familiar with it.Transactions and ACID make development easy.Lots of tools to use.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 11NoSQL Systems Alternative to traditional relational DBMS Flexible schema Quicker/cheaper to set up Massive scalability (scale horizontally instead ofvertically) Relaxed consistency higher performance &availability– No declarative query language more programming– Relaxed consistency fewer guaranteesCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 12Note 6

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLHow did we get here? Explosion of social media sites (Facebook, Twitter)with large data needs Rise of cloud-based solutions such as Amazon S3(Simple Storage Solution) Just as moving to dynamically-typed languages(Ruby/Groovy), a shift to dynamically-typed datawith frequent schema changes Open-source communityCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 13Seeds of the NoSQLMovement Three major development were the seeds of theNoSQL movement BigTable (Google) Dynamo (Amazon) Gossip protocol (discovery and error detection) Distributed key-value data store Eventual consistency CAP TheoremCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 14Note 7

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLCAP Theorem RevisitedCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 15The Perfect Storm Large datasets, acceptance of alternatives, anddynamically-typed data has come together in aperfect storm Not a backlash/rebellion against RDBMS SQL is a rich query language that cannot be rivaledby the current list of NoSQL offerings “NoSQL” “Not Only SQL”CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 16Note 8

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLWhy NoSQL?Example #1: Web log analysisEach record: UserID, URL, timestamp, additional-infoTask: Load into database systemCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 17Why NoSQL?Example #1: Web log analysisEach record: UserID, URL, timestamp, additional-infoTask: Find all records for Given UserIDGiven URLGiven timestampCertain construct appearing in additional-infoCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 18Note 9

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLWhy NoSQL?Example #1: Web log analysisEach record: UserID, URL, timestamp, additional-infoSeparate records: UserID, name, age, gender, Task: Find average age of user accessing given URLCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 19Why NoSQL?Example #2: Social-network graphEach record: UserID1, UserID2Separate records: UserID, name, age, gender, Task: Find all friends of friends of friends of friends ofgiven userCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 20Note 10

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLWhy NoSQL?Example #3: Wikipedia pagesLarge collection of documentsCombination of structured and unstructured dataTask: Retrieve introductory paragraph of all pages aboutU.S. presidents before 2015CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 21Types of NoSQL DBs There are 4 basic types of NoSQL DBs. We will discuss the first three types. Graph DBs will be discussed in next lecture.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 22Note 11

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLDynamo: Outline Background & motivation Implementation Giuseppe DeCandia, Deniz Hastorun, MadanJampani, Gunavardhan Kakulapati, AvinashLakshman, Alex Pilchin, Swami Sivasubramanian,Peter Vosshall and Werner Vogels, “Dynamo:Amazon's Highly Available Key-Value Store”, inthe Proceedings of the 21st ACM Symposium onOperating Systems Principles, Stevenson, WA,October 2007.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 23Amazon DynamoDBCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 24Note 12

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLBackgroundAmazon’s eCommerence platform architecture Composed of highly decentralized, looselycoupled, service-oriented architecture Service based on a well-defined interfaceaccessible over the network hosted in an infrastructure that consists of tens ofthousands of servers located across many datacenters world-wide Need high availability and SLA(Service LevelAgreements) guarantee CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 25Amazon ServicesMany services store and retrieve data based onkey (called key-value access) Examples of key-value access in Amazon o best seller lists, shopping carts, customer preferences,sales rankTraditional RDBMS as persistent store is notsuitableoooooNo need for strong consistencyNo use of rigid schemaNo need of complex querying and optimizationNo need for complex management functionalitiesScale up v.s. scale outCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 26Note 13

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLMotivationFocus on reliability and scalability Need a highly-available storage system instead ofconsistency Consistency v.s. Availability oooHigh availability is more importantClient-perceived consistencyTradeoff consistency in favor of higher availabilityCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 27Requirements andAssumptions Query model:o Simple read and write data based on keyo Data stored as a blob (Binary Large Object)o Object size small (less than 1MB) ACID propertieso Weaker consistency: Eventual consistencyo No isolation guaranteeo Only single key updatesCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 28Note 14

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLEventual Consistency When no updates occur for a long period of time,eventually all updates will propagate through thesystem and all the nodes will be consistent For a given accepted update and a given node,eventually either the update reaches the node orthe node is removed from service Known as BASE (Basically Available, Soft state,Eventual consistency), as opposed to ACIDCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 29Requirements andAssumptions Efficiencyo Based on commodity hardwareo Stringent SLA requirements (next slide)o Tradeoffs: performance, cost efficiency,availability, and durability Other: non-hostile environment, no securityrelated requirements (used only by Amazon’sinternal services)CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 30Note 15

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLService Level AgreementsDefinition: a formally negotiated contract where aclient and a service agree on several systemrelated characteristics, which most prominentlyinclude the client’s expected request ratedistribution for a particular API and the expectedservice latency under those conditions Example: response time within 300ms for 99.9% ofits requests for a peak client load of 500 req/sec SLAs expresses as 99.9th percentile of thedistribution o Not the traditional mean or averageo Why? What is the implication of this?CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 31Amazon’s Service OrientedInfrastructure Decentralized SOA (next slide) a page request to a e-commerce site typicallyrequires the rendering engine to construct itsresponse by sending requests to over 150 services Services often have multiple dependencies (callchains) To ensure a clear bound on page delivery eachservice within the call chain must obey itsperformance contractCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 32Note 16

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLSOA of Amazon’s platformCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 33Design Consideration Sacrifice strong consistency for availability Conflict resolution is executed during readinstead of write, i.e. “always writeable”. Other principles: Incremental y.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 34Note 17

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed ning ofdataConsistent uresSloppy QuorumProvides high availabilityand durability guaranteewhen some of thereplicas are not available.High availabilityfor writesVector clocks withreconciliationduring readsVersion size is decoupledfrom update ratesCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 35Implementation Partition: must be balanced Why ?Design requirement: to scale incrementallyo Need to partition data over the set ofnodes(e.g storage host) dynamicallyo balanced distribution of datao Consistent hashingoCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 36Note 18

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLBasic Consistent Hashing Hash keys to a fixed circular space or “ring” Each node is assigned a random position in thering Each data is assigned to a node by hashing its keyand walking clockwise Each node is responsible for the region between itand its predecessor Departure or arrival of a node only affects itsimmediate neighborsCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 37Partition: ConsistentCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 38Note 19

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLInsert New DataCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 39Insert new data: ReplicationCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 40Note 20

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLInsert New DataCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 41Adding New NodeCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 42Note 21

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLLoad BalancingCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 43ImplementationHandling temporary failures Sloppy Quoremo Availability too high will reduce durabilityeven under the simplest failureo Sloppy Quorem is to control the tradeoffbetween availability and consistencyo To get enough durability to handle temporaryfailuresCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 44Note 22

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLSloppy QuoremR/W is the minimum number of nodes thatmust participate in a successful read/writeoperation. Configurable N, R, W ooo N: number of successful copies in ideal stateR: number of successful reads nodes forsuccessful readW: number of successful writes nodes forsuccessful writeSetting R W N yields a quorum-like system.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 45Sloppy QuoremCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 46Note 23

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLSloppy Quorem: WriteCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 47Sloppy Quorem: WriteCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 48Note 24

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLSloppy Quorem: ReadCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 49Sloppy Quorem: ReadCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 50Note 25

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLSloppy Quorum: write afterB failsCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 51Sloppy Quorum: After BRecoverCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 52Note 26

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLSloppy QuoremN RW Affection3 22 Typical configuration,Consistent,durable, interactive user staten 1n Strong consistency while pooravailabilityn 11 High availability while weakconsistencyCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 53Implementation Data VersionDynamo provides fully availibilityo Consistency eventually consistencyo To guarantee eventually consistencyoCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 54Note 27

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLData Versioning A put() call may return to its caller before the update hasbeen applied at all the replicaso Put(key, context, object): context contains metadata& versiono Each put operation is a new immutable version A get() call may return many versions of the same object.o Get(key) Challenge: an object having distinct version subhistories, which the system will need to reconcile in thefuture. Solution: uses vector clocks in order to capture causalitybetween different versions of the same object.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 55Vector Clock A vector clock is a list of (node, counter) pairs. Every version of every object is associated with onevector clock. If the counters on the first object’s clock are lessthan-or-equal to all of the nodes in the secondclock, then the first is an ancestor of the secondand can be forgotten.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 56Note 28

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLData Versioning with VectorClockD0CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 57Gossip Admin issue command to join/remove node Serving node records in its local membershiphistory Gossip based protocol used to agree on thememberships Partition and Placement information sentduring gossipCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 58Note 29

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLREAD Operation Send read requests to nodes Wait for minimum no of responses (R) Too few replies fail within time bound Gather and find conflicting versions Create context (opaque to caller) Read repairCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 59Values of N, R and W N represents durability Typical value 3 W and R affect durability, availability,consistency What if W is low? Durability and Availability go hand-in-hand?CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 60Note 30

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLDynamo vs BigTableCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 61Conclusion and Influence Dynamo has provided high availability and faulttolerance Provides owners to customize according to theirSLA requirements Decentralized techniques can provide highlyavailable system Some of the principles used by S3 Open source implementation Cassandra VoldemortCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 62Note 31

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLApache Cassandra63CSIE59830/CSIEM0410 Big Data SystemsApache CassandraCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 64Note 32

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLA Picture is worth 1000wordsCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 65Proven The Facebook stores 150TB of data on 150 nodes Used at Twitter, Rackspace, Mahalo, Reddit,Cloudkick, Cisco, Digg, SimpleGeo, Ooyala, OpenX,othersCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 66Note 33

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLWhat is Cassandra A distributed data store for big data applications A schema free NoSQL distributed DBMS A hybrid between a key-value and a columnoriented data model High availability with no single point of failure Symmetric architecture to scale horizontally withautomatic cluster maintenance Tunable consistency Open sourceCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 67Design Goals High availability Flexible consistency trade-off strong consistency in favor of high availability Incremental scalability Optimistic Replication “Knobs” to tune tradeoffs between consistency,durability and latency Low total cost of ownership Minimal administrationCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 68Note 34

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLBest of Both Worlds From BigTable From Dynamo Sparse , ”columnar”– Symmetric,P2Parchitecturedata model Optional,2-level maps CalledSuper-Column Families SSTable Disk Storage Append-only Commit Log MemTable (Buffer & Sort) Immutable SSTable Files Hadoop Integration No Special nodes, NoSPOF(Single Point Of Failure)– Gossip Based clustermanagement– Distributed hash table fordata placement Pluggable partitioning Pluggable topology discovery Pluggable placement strategies– Tunable, EventualConsistencyCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 69Data Model The whole cluster contains several keyspacesKeyspace Typically, a cluster has one keyspace per application Data is stored as a multi dimensional map indexed bykey (row key)ColumnFamily Contains several simple columns or super columns SuperColumn Consists of several columns Column Described by name, value, timestampCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 70Note 35

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLSimple column family column family : columnCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 71Super column family column family : super column : columnCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 72Note 36

Lecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLCSIE59830/CSIEM0410 Big Data SystemsData ModelName : tid1Name : tid2Columns areadded andType : SimpleSort : NamemodifiedName : tid3 dynamicallyName : tid4Value : Binary Value : Binary Value : Binary Value : Binary TimeStamp : t1TimeStamp : t2TimeStamp : t3TimeStamp : t4ColumnFamily1 Name : MailListKEYColumnFamily2Column Familiesare declaredupfrontSuperColumnsare added andmodifiedColumnsaredynamicallyadded andmodifieddynamicallyCSIE59830/CSIEM0410 Big Data SystemsName : WordListType : SuperName : alohaSort : TimeName : ly3 Name : SystemType : SuperSort : NameName : hint1Name : hint2Name : hint3Name : hint4 Column List Column List Column List Column List Structured Big Data 2 – NoSQL, NewSQL & Distributed SQL 73Data Model Example Column Families:– Like SQL tables– but may beunstructured (clientspecified)– Can have index tables Hence “columnorienteddatabases”/“NoSQL”– No schemas– Some columnsmissing from someentries– “Not Only SQL”– Supports get(key) andput(key, value)operations– Often write-heavyworkloadsCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 74Note 37

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLConsistency Model Consistency level is based on replication factor N(usually 3) Can set read quorum R (usually 2) and writequorum W (usually 2) Different levels of consistency are allowed (nexttwo slides) R W N means strong consistencyCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 75Consistency Levels - WriteLevelDescriptionANYAt least one nodeONEAt least one replica nodeTWOAt least two replica nodesTHREEAt least three replica nodesQUORUMWrite to a quorum of replica nodesLOCAL QUORUM Write to a quorum of the current data center asthe coordinatorEACH QUORUMWrite to quorums of all data centersALLWrite to all replica nodes in the clusterCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 76Note 38

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLConsistency Levels - ReadLevelDescriptionONERead from the closest replicaTWORead from two of the closest replicasTHREERead from three of the closest replicasQUORUMRead from a quorum of replicasLOCAL QUORUM Read from a quorum of the current data centeras the coordinatorEACH QUORUMRead from quorums of all data centersALLRead from all replicas in the clusterCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 77Write Operations A client issues a write request to a random node inthe Cassandra cluster. The “Partitioner” determines the nodesresponsible for the data. Locally, write operations are logged and thenapplied to an in-memory version. Commit log is stored on a dedicated disk local tothe machine.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 78Note 39

Lecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLCSIE59830/CSIEM0410 Big Data SystemsWrite Properties No locks in the critical path Sequential disk access Behaves like a write back cache (vs write through) Append support without read ahead Atomicity guarantee for a key per replica “Always Writable” accept writes during failure scenariosCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 79ReadClientQueryResultCassandra ClusterClosest replicaRead repair ifdigests differResultReplica ADigest QueryDigest ResponseReplica BCSIE59830/CSIEM0410 Big Data SystemsDigest ResponseReplica CStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 80Note 40

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLGossip Protocols Network Communication protocols inspired for real liferumour spreading. Periodic, Pairwise, inter-node communication. Low frequency communication ensures low cost. Random selection of peers. Example – Node A wish to search for pattern in data Round 1 – Node A searches locally and then gossips with node B. Round 2 – Node A,B gossips with C and D. Round 3 – Nodes A,B,C and D gossips with 4 other nodes Round by round doubling makes protocol very robust.CSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 81Gossip - Initial StateCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 82Note 41

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLGossip – 1st RoundCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 83Gossip – 2nd RoundCSIE59830/CSIEM0410 Big Data SystemsStructured Big Data 2 – NoSQL, NewSQL & Distributed SQL 84Note 42

CSIE59830/CSIEM0410 Big Data SystemsLecture 07 Structured Big Data 2 – NoSQL, NewSQL,Distributed SQLGo

SQL vs NoSQL vs NewSQL NewSQL: Combines relational model of SQL databases with the versatile scalability and speed of NoSQL databases. Uses cluster-native and shared-nothing architecture to provide low latency, high read/write performance. Favors consistency over availability (though configurations can be tuned for better balance).