Databases : Lecture 11 : Beyond ACID/Relational Databases Timothy G .

Transcription

Databases : Lecture 11 :Beyond ACID/Relational databasesTimothy G. GriffinLent Term 2014 Rise of Web and cluster-based computing“NoSQL” MovementRelationships vs. AggregatesKey-value storeXML or JSON as a data exchange languageNot all applications require ACIDCAP Consistency, Availability, and Partition toleranceThe CAP theorem (pick any two?)Eventual consistencyApologies to Martin Fowler (“NoSQL Distilled”)

Application-specific databaseshave always been with us . . .Two that I am familiar with:But these systemsare proprietary.Daytona (AT&T): “Daytona is a data managementsystem, not a database”. Built on top of the unix filesystem, this toolkit is for building application-specificand highly scalable data stores. Is used at AT&Tfor analysis of 100s of terabytes of call records.http://www2.research.att.com/ daytona/Open source is ahallmark of NoSQLDataBlitz (Bell Labs, 1995) : Main-memorydatabase system designed for embedded systemssuch as telecommunication switches. Optimized forsimple key-driven queries.What’s new? Internet scale, cluster computing, open source . . .

Something big is happening inthe land of databasesThe Internet cluster computing open source systemsmany more points in thedatabase design spaceare being explored anddeployedBroader context helps clarify the strengths and weaknessesof the standard relational/ACID approach.

http://nosql-database.org/

Eric Brewer’s PODC Keynote(July 2000)ACID vs. BASE (Basically Available, Soft-state, Eventually consistent)ACID Strong consistencyIsolationFocus on “commit”Nested transactionsAvailability?Conservative (pessimistic)Difficult evolution (e.g. schema)BASEWeak consistencyAvailability firstBest effortApproximate answers OKAggressive (optimistic)Simpler!FasterEasier evolutionA wide spectrum with many design points“Real internet systems are a careful mixture of ACID and BASE subsystems”

The emerging world of databasesThis classification is notComplete and is a bitfuzzy-wuzzy. For example,drawing a clear distinction betweenKey-value stores andDocument-oriented databasesis not always easy. And this isRapidly evolving with a lot ofcross-fertilization.Often overlooked in thebusiness-oriented hoopla:This is making BigAnalyticsaffordable for many scientificefforts (bioinformatics, astronomy,physics, economics, ) Relational Postgres MySQL Graph databases Neo4j VertexDB Key-Value stores Riak Redis BerkeleyDB Column-oriented databases BigTable, Cassandra Hbase (build on Hadoop) Document-oriented MongoDB CouchDB

The emerging world of databasesAggregatesas a naturalunit ofupdateAttribute-oriented, Relational PostgresACID MySQL Graph databases Neo4j VertexDB Key-Value storesAggregate-oriented, RiakEventual consistency Redis BerkeleyDB Column-oriented databases BigTable, Cassandra Hbase (build on Hadoop) Document-oriented MongoDB CouchDB

Martin Fowler : “Welcome to theworld of polyglot persistence”More and more we will see data-oriented systems do and willcombine traditional Relational DBMS technology with NoSQLtechnology. Must understand what problems each technology solves Use right tool for the jobThis lecture : I will put emphasis on applications of the formTraditional a stores. NoSQLtechnology

Key-Value Stores Mapping Key to blob-of-byte that application must “parse” Example : Riak (modeled on Dynamo, eventual consistency), Cassandra Typically no “query-language” for values Mapping Key to “semi-structured” value Example: RedisHuge advantage: can design data representation so that alldata needed for a given update is present on a single machine.Data can easily be partitioned (say by key ranges) overmany machines. Map-reduce initiated from set of keys . . .Disadvantage: Data retrieved by key only. And it is hard to enforce relationshipsbetween different values. If this is important for your applications, then perhapsLook elsewhere

Tables require joinsFKS(A, B, C)FKR(C, D, E)(FK Foreign Key)T(E, D4E4F4A2B2C2D5E5F5. How couldtables bepartitioned overmultipleservers? Enforcingreferentialintegrity isVERY difficult ina distributeddatabase

The Key-value approachFKS(A, B, C)FKR(C, D, E)(FK Foreign Key)T(E, F)Use this 2B2C2D4E4F4A2B2C2D5E5F5.{A : A1,B : B1,stuff : [{D : D1, F: F1},{D : D2, F: F2},{D : D3, F: F3}]}.The collection of JSON objects (keyed on A) is horizontally partitioned(sharded) across many servers. When accessed, all of the application’sdata is in one object.

Example from Lecture 1

Document-oriented systems can be tomanage the RDBMS “Publishing Problem”Need to share data without exposing internal details of your database.DB 5DB 1Exports .txt filesin ad hoc formatExports ExcelLack of standardexchange formatsrequires theimplementation ofmany ad hoctranslatorsExports printeddocumentsDB 2Exports .txt filesin ad hoc formatDB 4DB 3DB 2Exports Word DocumentsExports HTML13

XML (or JSON) as a dataexchange formatDB 5UsingdocumentorientedNoSQLsoftware fordata exchangeis an attractiveoption.DB 1Exports XMLExports XMLXML/JSON conforming toagreed uponsemanticsDB 2Exports XMLDB 4Exports XMLDB 3DB 2Exports XMLExports XML14

Examples of domain specificXML DTDs (similardevelopments with JSON) There are now lots of DTDs that have beenagreed by groups, including–––––––WML: Wireless markup language (WAP)OFX: Open financial exchangeCML: Chemical markup languageAML: Astronomical markup languageMathML: Mathematics markup languageSMIL: Synchronised Multimedia Integration LanguageThML: Theological markup language15

Fallacies of DistributedComputing (Peter Deutsch)Essentially everyone, when they first build a distributed application,makes the following eight assumptions. All prove to be false in thelong run and all cause big trouble and painful learning experiences.1.The network is reliable2.Latency is zero3.Bandwidth is infinite4.The network is secure5.Topology doesn't change6.There is one administrator7.Transport cost is zero8.The network is allacies.html

Brewer’s CAP conjecture (2000) Consistency Availability Partition toleranceConjecture :You can have at most two.A formal proof:Nancy Lynch and Seth Gilbert,“Brewer's conjecture and the feasibilityof consistent, available, partition-tolerant web services”,ACM SIGACT News, Volume 33 Issue 2 (2002), pg. 51-59.

But what do the CAP terms really mean?There seems to be no consensus . . .Random samples of various definitions found in the literature Consistency The system can guarantee that once you store a state in thesystem, it will report the same state in every subsequent operationuntil the state is explicitly changed by something outside thesystem. Is equivalent to having a single up-to-date copy of the dataAvailability All clients can find some replica of the data, even in the presence offailures A guarantee that every request receives a response about whetherit was successful or failedPartition tolerance The system properties hold even when the system is partitioned The system continues to operate despite arbitrary message loss orfailure of part of the system

Pick any two? A betterformulation.then you must engineertrade-offs betweenSuppose youhave a highlydistributed systemConsistencyAvailability

Apologies to Martin Fowler ("NoSQL Distilled") Application-specific databases have always been with us . . . Daytona (AT&T): "Daytona is a data management system, not a database". Built on top of the unix file system, this toolkit is for building application-specific and highly scalable data stores. .