MongoDB Architecture Guide - Apphosting.io

Transcription

A MongoDB White PaperMongoDB Architecture GuideMongoDB 3.4November 2016

Table of ContentsIntroduction1How we Build & Run Modern Apps1The Nexus Architecture2MongoDB Multimodel Architecture3MongoDB Data Model4MongoDB Query Model6MongoDB Data Management8Consistency & Durability9Availability10Performance & Compression12Security12Operational Management13MongoDB Atlas15Conclusion15We Can Help16Resources16

Introduction“MongoDB wasn’t designed in a lab. We built MongoDBfrom our own experiences building large-scale, highavailability, robust systems. We didn’t start from scratch, wereally tried to figure out what was broken, and tackle that.So the way I think about MongoDB is that if you takeMySQL, and change the data model from relational todocument-based, you get a lot of great features:embedded docs for speed, manageability, agiledevelopment with dynamic schemas, easier horizontalscalability because joins aren’t as important. There are a lotof things that work great in relational databases: indexes,dynamic queries and updates to name a few, and wehaven’t changed much there. For example, the way youdesign your indexes in MongoDB should be exactly theway you do it in MySQL or Oracle, you just have the optionof indexing an embedded field.”— Eliot Horowitz, MongoDB CTO and Co-FounderMongoDB is designed for how we build and rundata-driven applications with modern developmenttechniques, programming models, computing resources,and operational automation.How We Build & Run ModernApplicationsRelational databases have a long-standing position in mostorganizations, and for good reason. Relational databasesunderpin existing applications that meet current businessneeds; they are supported by an extensive ecosystem oftools; and there is a large pool of labor qualified toimplement and maintain these systems.But organizations are increasingly considering alternativesto legacy relational infrastructure, driven by challengespresented in building modern applications: Developers are working with applications that createmassive volumes of new, rapidly changing data types —structured, semi-structured, unstructured andpolymorphic data. Long gone is the twelve-to-eighteen month waterfalldevelopment cycle. Now small teams work in agilesprints, iterating quickly and pushing code every weekor two, some even multiple times every day.1

Applications that once served a finite audience are nowdelivered as services that must be always-on, accessiblefrom many different devices on any channel, and scaledglobally to millions of users. Organizations are now turning to scale-out architecturesusing open source software, commodity servers andcloud computing instead of large monolithic servers andstorage infrastructure.The Nexus ArchitectureMongoDB’s design philosophy is focused on combining thecritical capabilities of relational databases with theinnovations of NoSQL technologies. Our vision is toleverage the work that Oracle and others have done overthe last 40 years to make relational databases what theyare today. Rather than discard decades of proven databasematurity, MongoDB is picking up where they left off bycombining key relational database capabilities with thework that Internet pioneers have done to address therequirements of modern applications.FigurFiguree 1: MongoDB Nexus Architecture, blending the bestof relational and NoSQL technologiesRelational databases have reliably served applications formany years, and offer features that remain critical today asdevelopers build the next generation of applications: ExprExpressiveessive query language & secondary IndexesIndexes.Users should be able to access and manipulate theirdata in sophisticated ways to support both operationaland analytical applications. Indexes play a critical role inproviding efficient access to data, supported natively bythe database rather than maintained in application code. StrStrongong consistencyconsistency. Applications should be able toimmediately read what has been written to thedatabase. It is much more complex to build applicationsaround an eventually consistent model, imposingsignificant work on the developer, even for the mostsophisticated engineering teams. Enterprise Management and Integrations.Databases are just one piece of applicationinfrastructure, and need to fit seamlessly into theenterprise IT stack. Organizations need a database thatcan be secured, monitored, automated, and integratedwith their existing technology infrastructure, processes,and staff, including operations teams, DBAs, and dataengineers.However, modern applications impose requirements notaddressed by relational databases, and this has driven thedevelopment of NoSQL databases which offer: Flexible DatDataa Model. NoSQL databases emerged toaddress the requirements for the data we seedominating modern applications. Whether document,graph, key-value, or wide-column, all of them offer aflexible data model, making it easy to store and combinedata of any structure and allow dynamic modification ofthe schema without downtime or performance impact. ScScalabilityalability and PPerformance.erformance. NoSQL databases wereall built with a focus on scalability, so they all includesome form of sharding or partitioning. This allows thedatabase to be scaled out across commodity hardwaredeployed on-premises or in the cloud, enabling almostunlimited growth with higher throughput and lowerlatency than relational databases. AlwaysAlways-OnOn Global Deployments. NoSQL databasesare designed for continuously available systems thatprovide a consistent, high quality experience for usersall over the world. They are designed to run across manynodes, including replication to automatically synchronizedata across servers, racks, andgeographically-dispersed data centers.While offering these innovations, NoSQL systems havesacrificed the critical capabilities that people have come toexpect and rely upon from relational databases. MongoDBoffers a different approach. With its Nexus Architecture,MongoDB is the only database that harnesses the2

FigurFiguree 2: Flexible Storage Architecture, optimising MongoDB for unique application demandsinnovations of NoSQL while maintaining the foundation ofrelational databases.MongoDB MultimodelArchitectureMongoDB embraces two key trends in modern IT: Organizations are rapidly expanding the range ofapplications they deliver to digitally transform thebusiness. CIOs are rationalizing their technology portfolios to astrategic set of vendors they can leverage to moreefficiently support their business.With MongoDB, organizations can address diverseapplication needs, hardware resources, and deploymentdesigns with a single database technology: MongoDB’s flexible document data model presents asuperset of other database models. It allows data berepresented as simple key-value pairs and flat, table-likestructures, through to rich documents and objects withdeeply nested arrays and sub-documents With an expressive query language, documents can bequeried in many ways – from simple lookups to creatingsophisticated processing pipelines for data analyticsand transformations, through to faceted search, JOINsand graph traversals. With a flexible storage architecture, application ownerscan deploy storage engines optimized for differentworkload and operational requirements.MongoDB’s multimodel design significantly reducesdeveloper and operational complexity when compared torunning multiple distinct database technologies to meetdifferent applications needs. Users can leverage the sameMongoDB query language, data model, scaling, security,and operational tooling across different parts of theirapplication, with each powered by the optimal storageengine.Flexible Storage ArchitectureMongoDB uniquely allows users to mix and match multiplestorage engines within a single deployment. This flexibilityprovides a more simple and reliable approach to meetingdiverse application needs for data. Traditionally, multipledatabase technologies would need to be managed to meetthese needs, with complex, custom integration code tomove data between the technologies, and to ensureconsistent, secure access. With MongoDB’s flexiblestorage architecture, the database automatically managesthe movement of data between storage enginetechnologies using native replication.3

MongoDB 3.4 ships with four supported storage engines,all of which can coexist within a single MongoDB replicaset. This makes it easy to evaluate and migrate betweenthem, and to optimize for specific application requirements– for example combining the in-memory engine for ultralow-latency operations with a disk-based engine forpersistence. The supported storage engines include: The default WiredTiger storage engine. For manyapplications, WiredTiger's granular concurrency controland native compression will provide the best all roundperformance and storage efficiency for the broadestrange of applications. The Encrypted storage engine protecting highlysensitive data, without the performance or managementoverhead of separate filesystem encryption. (RequiresMongoDB Enterprise Advanced). The In-Memory storage engine delivering the extremeperformance coupled with real time analytics for themost demanding, latency-sensitive applications.(Requires MongoDB Enterprise Advanced). The MMAPv1 engine, an improved version of thestorage engine used in pre-3.x MongoDB releases.MongoDB Data ModelData As DocumentsMongoDB stores data in a binary representation calledBSON (Binary JSON). The BSON encoding extends thepopular JSON (JavaScript Object Notation) representationto include additional types such as int, long, date, floatingpoint, and decimal128. BSON documents contain one ormore fields, and each field contains a value of a specificdata type, including arrays, binary data and sub-documents.FigurFiguree 3: Example relational data model for a bloggingapplicationDocuments that tend to share a similar structure areorganized as collections. It may be helpful to think ofcollections as being analogous to a table in a relationaldatabase: documents are similar to rows, and fields aresimilar to columns.For example, consider the data model for a bloggingapplication. In a relational database the data model wouldcomprise multiple tables. To simplify the example, assumethere are tables for Categories, Tags, Users, Commentsand Articles.In MongoDB the data could be modeled as two collections,one for users, and the other for articles. In each blogdocument there might be multiple comments, multiple tags,and multiple categories, each expressed as an embeddedarray.4

Dynamic Schema without CompromisingData GovernanceMongoDB documents can vary in structure. For example,all documents that describe users might contain the user idand the last date they logged into the system, but onlysome of these documents might contain the user’s identityfor one or more third party applications. Fields can varyfrom document to document; there is no need to declarethe structure of documents to the system – documents areself describing. If a new field needs to be added to adocument then the field can be created without affectingall other documents in the system, without updating acentral system catalog, and without taking the systemoffline.Developers can start writing code and persist the objectsas they are created. And when developers add morefeatures, MongoDB continues to store the updated objectswithout the need to perform costly ALTER TABLEoperations, or worse – having to re-design the schemafrom scratch.FigurFiguree 4: Data as documents: simpler for developers,faster for users.As this example illustrates, MongoDB documents tend tohave all data for a given record in a single document,whereas in a relational database information for a givenrecord is usually spread across many tables. With theMongoDB document model, data is more localized, whichsignificantly reduces the need to JOIN separate tables.The result is dramatically higher performance andscalability across commodity hardware as a single read tothe database can retrieve the entire document containingall related data. Unlike many NoSQL databases, users don’tneed to give up JOINs entirely. For additional flexibility,MongoDB provides the ability to perform left-outer JOINsthat combine data from multiple collections.MongoDB BSON documents are closely aligned to thestructure of objects in the programming language. Thismakes it simpler and faster for developers to model howdata in the application will map to data stored in thedatabase.Document ValidationDynamic schemas bring great agility, but it is also importantthat controls can be implemented to maintain data quality,especially if the database is shared by multipleapplications. Unlike NoSQL databases that pushenforcement of these controls back into application code,MongoDB provides document validation within thedatabase. Users can enforce checks on documentstructure, data types, data ranges and the presence ofmandatory fields. As a result, DBAs can apply datagovernance standards, while developers maintain thebenefits of a flexible document model.Schema DesignAlthough MongoDB provides schema flexibility, schemadesign is still important. Developers and DBAs shouldconsider a number of topics, including the types of queriesthe application will need to perform, relationships betweendata, how objects are managed in the application code, andhow documents will change over time. Schema design is anextensive topic that is beyond the scope of this document.5

For more information, please see Data ModelingConsiderations.MongoDB Query Modelsubscriptions used with your self-managed instances, orhosted MongoDB Atlas instances. MongoDB Compass isfree to use for evaluation and in developmentenvironments.Idiomatic DriversMongoDB provides native drivers for all popularprogramming languages and frameworks to makedevelopment natural. Supported drivers include Java,Javascript, .NET, Python, Perl, PHP, Scala and others, inaddition to 30 community-developed drivers. MongoDBdrivers are designed to be idiomatic for the given language.One fundamental difference with relational databases isthat the MongoDB query model is implemented asmethods or functions within the API of a specificprogramming language, as opposed to a completelyseparate language like SQL. This, coupled with the affinitybetween MongoDB’s JSON document model and the datastructures used in object-oriented programming, makesintegration with applications simple. For a complete list ofdrivers see the MongoDB Drivers page.Interacting with the DatabaseMongoDB offers developers and administrators a range oftools for interacting with the database, independent of thedrivers.The mongo shell is a rich, interactive JavaScript shell that isincluded with all MongoDB distributions. AdditionallyMongoDB Compass is a sophisticated and intuitive GUI forMongoDB. Offering rich schema exploration andmanagement, Compass allows DBAs to modify documents,create validation rules, and efficiently optimize queryperformance by visualizing explain plans and index usage.Sophisticated queries can be built and executed by simplyselecting document elements from the user interface, withthe results viewed both graphically, and as a set of JSONdocuments. All of these tasks can be accomplished from apoint and click interface, and all with zero knowledge ofMongoDB's query language.MongoDB Compass is included with both MongoDBProfessional and MongoDB Enterprise AdvancedFigurFiguree 5: Interactively build and execute database querieswith MongoDB CompassQuerying and Visualizing DataUnlike NoSQL databases, MongoDB is not limited tosimple Key-Value operations. Developers can build richapplications using complex queries, aggregations andsecondary indexes that unlock the value in structured,semi-structured and unstructured data.A key element of this flexibility is MongoDB's support formany types of queries. A query may return a document, asubset of specific fields within the document or complexaggregations and transformation of many documents: Key-value queries return results based on any field inthe document, often the primary key. Range queries return results based on values definedas inequalities (e.g, greater than, less than or equal to,between). Geospatial queries return results based on proximitycriteria, intersection and inclusion as specified by apoint, line, circle or polygon. SearSearcch queries return results in relevance order and infaceted groups, based on text arguments using Booleanoperators (e.g., AND, OR, NOT), and through bucketing,grouping and counting of query results. With support forcollations, data comparison and sorting order can bedefined for over 100 different languages and locales.6

AggrAggregationegation FFrameworkramework queries return aggregationsand transformations of values returned by the query(e.g., count, min, max, average, similar to a SQL GROUPBY statement). JOIJOINsNs and graph traversals. Through the lookupstage of the aggregation pipeline, documents fromseparate collections can be combined through a leftouter JOIN operation. graphLookup brings nativegraph processing within MongoDB, enabling efficienttraversals across trees, graphs and hierarchical data touncover patterns and surface previously unidentifiedconnections. MapReduce queries execute complex data processingthat is expressed in JavaScript and executed acrossdata in the database.Additionally the MongoDB Connector for Apache Sparkexposes Spark’s Scala, Java, Python, and R libraries.MongoDB data is materialized as DataFrames andDatasets for analysis through machine learning, graph,streaming, and SQL APIs.Data Visualization with BI ToolsMongoDB includes support for many types of secondaryindexes that can be declared on any field in the document,including fields within arrays: Unique IndexesIndexes. By specifying an index as unique,MongoDB will reject inserts of new documents or theupdate of a document with an existing value for the fieldfor which the unique index has been created. By defaultall indexes are not set as unique. If a compound index isspecified as unique, the combination of values must beunique. Compound Indexes. It can be useful to createcompound indexes for queries that specify multiplepredicates For example, consider an application thatstores data about customers. The application may needto find customers based on last name, first name, andcity of residence. With a compound index on last name,first name, and city of residence, queries couldefficiently locate people with all three of these valuesspecified. An additional benefit of a compound index isthat any leading field within the index can be used, sofewer indexes on single fields may be necessary: thiscompound index would also optimize queries looking forcustomers by last name.With the MongoDB Connector for BI included in MongoDBEnterprise Advanced, modern application data can beeasily analyzed with industry-standard SQL-based BI andanalytics platforms. Business analysts and data scientistscan seamlessly analyze semi and unstructured datamanaged in MongoDB, alongside traditional data in theirSQL databases using the same BI tools deployed withinmillions of enterprises. Array Indexes. For fields that contain an array, eacharray value is stored as a separate index entry. Forexample, documents that describe products mightinclude a field for components. If there is an index onthe component field, each component is indexed andqueries on the component field can be optimized by thisindex. There is no special syntax required for creatingarray indexes – if the field contains an array, it will beindexed as a array index.Indexing TTL Indexes. In some cases data should expire out ofthe system automatically. Time to Live (TTL) indexesallow the user to specify a period of time after which thedata will automatically be deleted from the database. Acommon use of TTL indexes is applications thatmaintain a rolling window of history (e.g., most recent100 days) for user actions such as clickstreams.Indexes are a crucial mechanism for optimizing systemperformance and scalability while providing flexible accessto the data. Like most database management systems,while indexes will improve the performance of someoperations by orders of magnitude, they incur associatedoverhead in write operations, disk usage, and memoryconsumption. By default, the WiredTiger storage enginecompresses indexes in RAM, freeing up more of theworking set for documents. Geospatial Indexes. MongoDB provides geospatialindexes to optimize queries related to location within atwo dimensional space, such as projection systems forthe earth. These indexes allow MongoDB to optimizequeries for documents that contain points or a polygon7

that are closest to a given point or line; that are within acircle, rectangle, or polygon; or that intersect with acircle, rectangle, or polygon. Partial Indexes. By specifying a filtering expressionduring index creation, a user can instruct MongoDB toinclude only documents that meet the desiredconditions, for example by only indexing activecustomers. Partial indexes balance delivering lowlatency query performance while reducing systemoverhead. Sparse Indexes. Sparse indexes only contain entriesfor documents that contain the specified field. Becausethe document data model of MongoDB allows forflexibility in the data model from document to document,it is common for some fields to be present only in asubset of all documents. Sparse indexes allow forsmaller, more efficient indexes when fields are notpresent in all documents. Text SearSearcch Indexes. MongoDB provides a specializedindex for text search that uses advanced,language-specific linguistic rules for stemming,tokenization, case sensitivity and stop words. Queriesthat use the text search index will return documents inrelevance order. One or more fields can be included inthe text index.Query OptimizationMongoDB automatically optimizes queries to makeevaluation as efficient as possible. Evaluation normallyincludes selecting data based on predicates, and sortingdata based on the sort criteria provided. The queryoptimizer selects the best index to use by periodicallyrunning alternate query plans and selecting the index withthe best response time for each query type. The results ofthis empirical test are stored as a cached query plan andare updated periodically. Developers can review andoptimize plans using the powerful explain method andindex filters. Using MongoDB Compass, DBAs canvisualize index coverage, enabling them to determine whichspecific fields are indexed, their type, size, and how oftenthey are used. Compass also provides the ability tovisualize explain plans, presenting key information on howa query performed – for example the number of documentsreturned, execution time, index usage, and more. Eachstage of the execution pipeline is represented as a node ina tree, making it simple to view explain plans from queriesdistributed across multiple nodes.Index intersection provides additional flexibility by enablingMongoDB to use more than one index to optimize anad-hoc query at run-time.Covered QueriesQueries that return results containing only indexed fieldsare called covered queries. These results can be returnedwithout reading from the source documents. With theappropriate indexes, workloads can be optimized to usepredominantly covered queries.MongoDB Data ManagementAuto-ShardingMongoDB provides horizontal scale-out for databases onlow cost, commodity hardware or cloud infrastructure usinga technique called sharding, which is transparent toapplications. Sharding distributes data across multiplephysical partitions called shards. Sharding allowsMongoDB deployments to address the hardwarelimitations of a single server, such as bottlenecks in RAMor disk I/O, without adding complexity to the application.MongoDB automatically balances the data in the shardedcluster as the data grows or the size of the clusterincreases or decreases.Unlike relational databases, sharding is automatic and builtinto the database. Developers don't face the complexity ofbuilding sharding logic into their application code, whichthen needs to be updated as shards are migrated.Operations teams don't need to deploy additionalclustering software or expensive shared-disk infrastructureto manage process and data distribution or failure recovery.FigurFiguree 6: Automatic sharding provides horizontal scalabilityin MongoDB.8

Unlike other distributed databases, multiple shardingpolicies are available that enable developers andadministrators to distribute data across a cluster accordingto query patterns or data locality. As a result, MongoDBdelivers much higher scalability across a diverse set ofworkloads:query to all shards, aggregating and sorting the results asappropriate. Multiple query routers can be used with aMongoDB system, and the appropriate number isdetermined based on performance and availabilityrequirements of the application. Range SharSharding.ding. Documents are partitioned acrossshards according to the shard key value. Documentswith shard key values close to one another are likely tobe co-located on the same shard. This approach is wellsuited for applications that need to optimize rangebased queries. Hash SharSharding.ding. Documents are distributed accordingto an MD5 hash of the shard key value. This approachguarantees a uniform distribution of writes acrossshards, but is less optimal for range-based queries. Zone SharSharding.ding. Provides the the ability for DBAs andoperations teams to define specific rules governing dataplacement in a sharded cluster. Zones accommodate arange of deployment scenarios – for example locatingdata by geographic region, by hardware configurationfor tiered storage architectures, or by applicationfeature. Administrators can continuously refine dataplacement rules by modifying shard key ranges, andMongoDB will automatically migrate the data to its newzone.Tens of thousands of organizations use MongoDB to buildhigh-performance systems at scale. You can read moreabout them on the MongoDB scaling page.Query RouterSharding is transparent to applications; whether there isone or one hundred shards, the application code forquerying MongoDB is the same. Applications issue queriesto a query router that dispatches the query to theappropriate shards.For key-value queries that are based on the shard key, thequery router will dispatch the query to the shard thatmanages the document with the requested key. Whenusing range-based sharding, queries that specify ranges onthe shard key are only dispatched to shards that containdocuments with values within the range. For queries thatdon’t use the shard key, the query router will broadcast theFigurFiguree 7: Sharding is transparent to applications.ConsistencyTransaction Model & Configurable WriteAvailabilityMongoDB is ACID compliant at the document level. One ormore fields may be written in a single operation, includingupdates to multiple sub-documents and elements of anarray. The ACID guarantees provided by MongoDB ensurescomplete isolation as a document is updated; any errorscause the operation to roll back and clients receive aconsistent view of the document.MongoDB also allows users to specify write availability inthe system using an option called the write concern. Thedefault write concern acknowledges writes from theapplication, allowing the client to catch network exceptionsand duplicate key errors. Developers can use MongoDB'sWrite Concerns to configure operations to commit to theapplication only after specific policies have been fulfilled –for example only after the operation has been flushed tothe journal on disk. This is the same mode used by manytraditional relational databases to provide durabilityguarantees. As a distributed system, MongoDB presentsadditional flexibility in enabling users to achieve theirdesired durability goals, such as writing to at least tworeplicas in one data center and one replica in a seconddata center. Each query can specify the appropriate writeconcern, ranging from unacknowledged to9

acknowledgement that writes have been committed to allreplicas.AvailabilityReplicationMongoDB maintains multiple copies of data called replicasets using native replication. A replica set is a fullyself-healing shard that helps prevent database downtimeand can be used to scale read operations. Replica failoveris fully automated, eliminating the need for administratorsto intervene manually.A replica set consists of multiple replicas. At any given timeone member acts as the primary replica set member andthe other members act as secondary replica set members.MongoDB is strongly consistent by default: reads andwrites are issued to a primary copy of the data. If theprimary member fails for any reason (e.g., hardware failure,network partition) one of the secondary members isautomatically elected to primary, typically within severalseconds. As discussed below, sophisticated rules governwhich secondary replicas are evaluated for promotion tothe primary member.FigurFiguree 8: Self-Healing MongoDB Replica Sets for HighAvailabilityThe number of replicas in a MongoDB replica set isconfigurable: a larger number of replicas providesincreased data durability and protection against databasedowntime (e.g., in case of multiple machine failures, rackfailures, data center failures, or network partitions). Up to50 members can be provisioned per replica set.Enabling tunable consistency, applications can optionallyread from secondary replicas, where data is eventuallyconsistent by default. Reads from secondaries can beuseful in scenarios where it is acceptable for data to beslightly out of date, such as some reporting and analyticalapplications. Administrators can control which secondarymembers service a query, based on a consistency windowdefined in the driver. For data-center aware reads,applications can also read from

Relational databases have a long-standing position in most organizations, and for good reason. Relational databases underpin existing applications that meet current business needs; they are supported by an extensive ecosystem of tools; and there is a large pool of labor qualified to implement and maintain these systems.