Introduction To NoSQL Databases

Transcription

Introduction to NoSQL DatabasesDatabases 2 (VU) (706.711 / 707.030)Roman KernInstitute of Interactive Systems and Data Science,Technical University Graz2018-10-15Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-151 / 31

IntroWhy NoSQL?Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-152 / 31

IntroductionThe birth of NoSQLTerm appeared in 2009Not only SQLCommon properties (pros)IIINon relationalSchema-less (schema free)Good scalabilityPotential down-sides (cons)IILimited query abilitiesNot standardised (evolving technology)Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-153 / 31

IntroductionMotivations for starting NoSQL1 Growth of dataIII2Need for flexibilityII3User-generatedMachine-generated, e.g. log-files, sensorsHigher degree of connectedness instead of a rigid schemaFor semi-structured data (schema-free / schema-less)No separation of data management and data processingRoman Kern (ISDS, TU Graz)NoSQL DBs2018-10-154 / 31

IntroductionData Management vs. Data ProcessingClassic CRUD operations no longer sufficientII for advanced data analytics need to combine both functionalitiesParadigm shift: Bring the code to the dataIIi.e. the locality of data is taken into considerations for the data processingExample applications:IIIOnline transaction processing (OLTP) relational databasesOnline analytical processing (OLAP) data warehousingHigh performance, scalability NoSQLRoman Kern (ISDS, TU Graz)NoSQL DBs2018-10-155 / 31

IntroductionScalabilityScale up (scale vertically) vs scale out (scale horizontally)IIScale up: Add more hardware to a single machineScale out: Add more machinesDegree of sharingIIIShared memory (single machine, single storage)Shared disk (multiple machines, single storage)Shared nothing (multiple machines, multiple storage)Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-156 / 31

IntroductionReplicationIn an distributed system, data is replicated between nodesI thus data is stored multiple timesTypes of replication1Synchronous (eager)FF2All data is replicated to all nodes before ending the operation complex, even impossible in some configurationsAsynchronous (lazy)FFOperation is finished before all data has been written by all nodes potentially inconsistentAccess for writing options12Single node accepts writing of data (master/slave, primary copy)All nodes accept write operations (update anywhere)Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-157 / 31

IntroductionShardingIn an distributed system, each node may be responsible for different parts of the full dataIII still data is replicated for redundancyAlso known as: partitioning, fragmentationAdvantage: improved efficiency (fewer resources)Types of sharding:1Hash-basedFHash-key determines partition no data locality2Range-based3Entity-groupFFAssigns range (binning) rebalancing neededAll data from single transactions assigned to a single partition partitions cannot easily changeRoman Kern (ISDS, TU Graz)NoSQL DBs2018-10-158 / 31

IntroductionACID vs. yBASEIIIBasically AvailableSoft stateEventually consistentTrade-offs for improved performanceSome database systems prefer performance over durabilityRedundancy for improved performance (no normalisation)Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-159 / 31

IntroductionCAP theoremNot possible to achieve all three properties:IConsistentFIAvailabilityFIReads are guaranteed to incorporate all previous writes (all nodes see the same data at the sametime)Every query returns an answer, instead of an error (failures do not prevent the remaining system tobe operational)PartitionedFThe systems runs, even if a part of the system is not reachable (e.g. due to network failure, messageloss)Implications of CAPOne needs to find a trade-off between the properties, e.g. choose availability over consistency(as consistency is a major bottleneck for scalability)Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1510 / 31

IntroductionClassification scheme of NoSQL systems1 According to the data modelIIIII2According to the CAP trade-offIII3Key-ValuesTabular (wide column)DocumentGraphSpecialised, e.g. time-series, triples, objects, XML, files, Available & partition tolerantConsistent & partition tolerantNot partition tolerantAccording to the replication & sharding typesIIlazy vs. eagerhash based vs. range based vs. entity-groupRoman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1511 / 31

NoSQL SystemsWhat types of NoSQL systems are out there?Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1512 / 31

NoSQL SystemsDistributed File SystemData model Folders & files (plus metadata, e.g. time of creation, )Interface File system operationsVariationsNetwork File System: (often) single storageCluster File Systems: (multiple) storageDistributed File Systems: multiple, independent storageExamples NFS, GPFS, HDFSRoman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1513 / 31

NoSQL SystemsKey/Value StoreData model Key Value where the value is a (binary) opaque blob similar to hash-tablesInterface CRUD operationsPropertiesExcellent scalabilityMay support redundant storageExamples Amazon Dynamo (AP, lazy, hash-based), Redis (CP, lazy, hash-based), Riak (AP,lazy, hash-based), Memcached (CP), Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1514 / 31

NoSQL SystemsTabular / Wide ColumnData model (Rowkey, Column, Timestamp) Value where the value is a (binary) opaque blobInterface CRUD operations, scan operationsPropertiesAllow vertical and horizontal partitioning adjacent rows are stored closed to other certain columns are stored close to each other, e.g. via column familiesEach cell might have multiple versions (timestamps)Examples Cassandra (AP, lazy, hash-based), Google BigTable (CP, eager, range-based),HBase (CP, eager, range-based), Parquet, Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1515 / 31

NoSQL SystemsExample of Cassandra Query LanguageRoman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1516 / 31

NoSQL SystemsDocument StorageData model (Collection, Key) Value where the value is understood by the systemInterface CRUD operations, specialised queries (e.g. JavaScript)PropertiesDocuments are schema free, i.e. no need for schema migrationsDocuments may also be versionedDocuments are often JSONExamples CouchDB (AP, lazy), MongoDB (CP, lazy—eager, range-based), Amazon SimpleDB(AP), Cloudant, Rethink (lazy—eager, range-based), Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1517 / 31

NoSQL SystemsKey/Value Store vs. Document Storage vs. Tabular StorageKey/Value store, if requirements are simpleDocument store, if need to access parts of the valueDocument store, if documents are independent unitsTabular store, if multiple entries (e.g. rows) are updated at the same timeTabular store, if only certain columns need to be retrievedThings to watch out forMaximum size of value depends on actual implementationAvoid joins for optimal performanceRoman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1518 / 31

NoSQL SystemsConsistency vs. Availability vs. PartitioningSee also: emsRoman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1519 / 31

NoSQL SystemsGraph StorageData model G (V , E) where each vertex or edge may have additional propertiesInterface Graph traversals, specialised queries & insert/update methodsPropertiesOptimised for graph traversal, i.e. no joins neededTypes of edges can be specified by the userExamples Neo4J (CA), OrientDB (CA), TitanDB, Giraph, InfiniteGraph (CA), Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1520 / 31

NoSQL SystemsSearch StorageData model documents, metadata often stored as Vector Space ModelInterface specialised query languagesPropertiesDocuments may consist of multiple fields (facets) field may be structured as well, e.g. date, integer, stringsFine control over indexing process, i.e. how each field is indexedExamples Solr, ElasticSearch, CrateDB, Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1521 / 31

NoSQL SystemsObject Oriented StorageData model classes, objects, relationsInterface CRUD, traversal methodsPropertiesKnown model from OO programmingOften strong coupling between DB system and programming languageExamples db4o (Ca), Versant (CA), Objectivity (CA), Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1522 / 31

NoSQL SystemsXML DatabasesData model XML, RDF (triples)Interface CRUD, query languages (XQuery, SPARQL, )PropertiesRDF based systems often called TripleStoreOften used in combination with semantic technologiesExamples BaseX, MarkLogic (CA), AllegroGraph (CA), BigData, Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1523 / 31

NoSQL SystemsTimeseries DatabasesData model (timestamp) valueInterface CRUD, specialised query languagesVariationsType of value is the sample for all entries, typically simple, e.g. floating pointnumberComplex value type, e.g. JSONPropertiesOptimised for time series data, i.e. small storage requirementsQuery for time rangesOperations on time seriesExamples InfluxDB, KairoDB, Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1524 / 31

NoSQL SystemsIn-Memory DatabasesData model (key) value but not limited to this modelInterface CRUD, specialised query languagesPropertiesData is stored in RAMOften distributed over multiple machine (RAM is the new Disk)In its purest form does not satisfy durability criteriaExamples Hazelcast, Redis, SAP HANA, Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1525 / 31

NoSQL SystemsAPI & Data FormatsNoSQL system often use RESTful APIsIDirect match with data model and CRUD operationsSerialisation of objectsIIMany techniques usede.g. Apache Avro, Protocol Buffers, Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1526 / 31

NoSQL SystemsFeaturesNot all NoSQL systems support transactionsIIInstead they support atomic single transactionsTherefore not all operations are supportedNot all NoSQL systems support security featuresIe.g. access controlRoman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1527 / 31

NoSQL SystemsCloud Database SolutionsStorage in the internet (cloud)DBaaS - Database as a ServiceNot limited to NoSQL, traditional SQL are available as wellMulti-tenancy as important feature (separation of multiple clients)IIIIPrivate OS - all separate (e.g. Amazon RDS)Private process - same machine (e.g. Compose)Private schema - same database (e.g. Google DataStore)Shared schema - same tables (most SaaS apps)Roman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1528 / 31

Current StateCurrent state of data storage systemsDepending on the actual requirements select a suitable storage solutionOr select multiple solutions for each sub-systemI polyglot persistenceRoman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1529 / 31

Future of NoSQL SystemsOutlook - NewSQLAttempt to achieve consistency and availability for distributed systemsIIIE.g. Google Spanner, CockroachDB build on the Raft Consensus algorithm relies on specialised hardwareRoman Kern (ISDS, TU Graz)https://github.com/cockroachdb/cockroachNoSQL DBs2018-10-1530 / 31

The EndNext: Graph DatabasesCreditsScalable Data Management: NoSQL Data Stores in Research and Practicehttp://icde2016.fi/tutorials.phpRoman Kern (ISDS, TU Graz)NoSQL DBs2018-10-1531 / 31

Examples Cassandra (AP, lazy, hash-based), Google BigTable (CP, eager, range-based), HBase (CP, eager, range-based), Parquet, Roman Kern (ISDS, TU Graz) NoSQL DBs 2018-10-15 15/31. NoSQL Systems Example of Cassandra ery Language Roman Kern (ISDS, TU Graz) NoSQL DBs 2018-10-15 16/31. NoSQL Systems Document Storage Data model (Collection;Key) !Value where the value is