MongoDB Vs. Couchbase Server

Transcription

MongoDB vs. Couchbase Server:Architectural Differences and Their ImpactThis 45-page paper compares two popular NoSQL offerings,diving into their architecture, clustering, replication, and caching.By Vladimir Starostenkov,R&D Engineer at AltorosQ3 2015

Table of Contents1. OVERVIEW . 32. INTRODUCTION . 33. ARCHITECTURE. 44. CLUSTERING . 44.1 MongoDB .44.2 Couchbase Server .54.3 Partitioning .64.4 MongoDB .74.5 Couchbase Server .74.6 Summary .75. REPLICATION . 85.1 MongoDB .85.2 Couchbase Server .105.3 Summary .126. CACHING . 136.1 MongoDB .136.2 Couchbase Server .136.3 Impact on Performance .137. CONCLUSION . 148. APPENDIX A. NOTES ON MAIN TOPICS . 148.1 Notes on Topology .148.2 Notes on Partitioning .158.3 Notes on Couchbase Replication .218.4 Notes on Availability .218.5 Notes on Read Path .258.6 Notes on Durability .289. APPENDIX B. NOTES ON MEMORY . 329.1 MongoDB .329.2 Couchbase Server .3410. APPENDIX C. NOTES ON STORAGE ENGINE . 3810.1 MongoDB .3810.2 Couchbase Server .4111. ABOUT THE AUTHORS . 45 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!2

1. OverviewOrganizations integrating distributed computing into their operations face the challenge of adopting adatabase that is up to the new challenges presented by a new architecture. Whether virtualizing localresources, scaling horizontally, or both, a distributed computing architecture requires new thinkingabout how to handle flows of un structured and semi-structured data, often in real-time.The development of open source NoSQL database options addresses this challenge. NoSQL—generally meant to stand for “Not only SQL”—provides a method to handle these new data flowsreliably and at scale.Two of the most prominent NoSQL offerings come from MongoDB and Couchbase. Both are popular,both perform well, and both have major customers and strong communities.2. IntroductionMongoDB and Couchbase Server are the two most popular document-oriented databases, designedto handle the semi-structured increasingly found in enterprise cloud computing and Internet of Thingsdeployments. They belong to a new generation of NoSQL databases.They are also both considered to be CP systems within the well-known CAP theorem systempostulated by Professor Eric Brewer of UC-Berkeley. However, they both have options to behave likean AP system.The letters in the CAP acronym represent: Consistency (the latest information is always available everywhere)Availability (every read and write request receives a response)Partitioning Tolerance (which can be thought of as a form of fault tolerance)The CAP theorem states that a database cannot simultaneously provide all three of the aboveguarantees. Practically speaking, NoSQL databases are forced to choose whether to favorconsistency or availability in specific scenarios.Within this context, CP systems, such as MongoDB and Couchbase Server, can be adjusted to favoravailability via settings and configuration options that change them to be AP systems such that data iseventually consistent, often within milliseconds. MongoDB and Couchbase Server rely onasynchronous persistence for performance but support synchronized persistence when necessary.They also have options to weaken consistency for availability.This paper identifies key architectural differences between the two databases. It then highlights theimpact of these architectural decision on the performance, scalability, and availability of each. 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!3

3. ArchitectureThis white paper is not a performance benchmark report. Instead, it offers an architect’s view as wellas a deep technical comparison of how the two databases go about their tasks.Some conclusions are offered, with the caveat that selection of a NoSQL database for any individualenterprise or project must be based on the unique goals, needs, and expectations faced byexecutives, project managers, technical evaluators and decision makers, product owners, etc.This white paper was written by a joint team of Altoros engineers and technical writers from thecompany’s Sunnyvale, CA headquarters and Minsk Development Center. It starts with an examinationof clustering, which forms the key difference between MongoDB and Couchbase Server. It then looksat the implications of clustering on availability, scalability, performance, and durability.Additional technical information on these topics is provided in Appendix A. Supplemental informationon memory and storage is provided in Appendix B and C.4. ClusteringThe most fundamental feature of a distributed system is its topology—the set of properties that do notchange under system scaling. Whether there are 10 or 1,000 nodes in a cluster, the node types, theirroles, and connection patterns stay the same.By looking at the topology, we can understand the system components and the information flowsamong them.The two databases under examination here offer profoundly different topologies. MongoDB takes ahierarchical, master/slave approach, while Couchbase Server has a peer-to-peer (P2P) topology.4.1 MongoDBMongoDB deployment architecture is very flexible and configurable, with a dominating hierarchicaltopology type. Final deployment properties with respect to system performance and availabilitydepend on the database administrator experience, various community best practices, and the finalcluster purpose.We will consider a sharded cluster here, suggested in the official MongoDB documentation as thedeployment best capable of scaling out. 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!4

A sharded cluster has the following components: shards, query routers, and config servers. Eachcomponent is installed on a separate server in a production environment.Shards store the data. To provide high availability and data consistency each shard is a replica set—anumber of mongod processes located on different physical machines. The replica set is itself ahierarchical structure having primary and secondary nodes within it.Query Routers, or mongos instances, interface with client applications and direct operations to theappropriate shard or shards. They process queries, target operations to appropriate shards and thenreturn results to the clients. A sharded cluster usually contains more than one query router to dividethe client request load. A client sends requests to one query router.Config servers store the cluster’s metadata. This data contains a mapping of the cluster’s data set tothe shards. The query router uses this metadata to target operations to specific shards. Productionsharded clusters have exactly three config servers.4.2 Couchbase ServerCouchbase Server has a flat topology with a single node type. All the nodes play the same role in acluster, that is, all the nodes are equal and communicate to each other on demand.One node is configured with several parameters as a single-node cluster. The other nodes join thecluster and pull its configuration. After that, the cluster is operated by connecting to any of the nodesvia Web UI or CLI. 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!5

Going deeper into the details, we see each node runs several processes logically grouped into a DataManager and Cluster Manager. The Data Manager serves client requests and stores data. TheCluster Manager is responsible for node and cluster-wide services management. It monitors eachnode in the cluster, detects node failures, performs node failovers and rebalance operations, etc.There are additional brief notes on this topic in Appendix A.4.3 PartitioningNoSQL databases were created to handle large data sets and high throughput applications thatchallenge the capacity of a single server. Users can scale up vertically by adding more resources to aserver, scale out horizontally by adding more servers, or both.Horizontal scaling is proving popular and treats servers as a commodity. It divides the data set anddistributes the data over multiple servers, with all the servers collectively making up a single logicaldatabase.The partitioning of data thus emerges as a key issue. Both Couchbase Server and MongoDB weredesigned with commodity hardware in mind and strongly rely on data partitioning. The two databaseprograms take different approaches in treating the data logically (viewed as in a single system),physically, and process-wise.This information, along with additional extensive notes, is found in Appendix A. 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!6

4.4 MongoDBMongoDB partitions data sets into chunks with a chunk storing all data within a specific range, andassigns chunks to shards. By default, it creates two 64 megabyte chunks per shard. When a chunk isfull, MongoDB splits it into two smaller 32 megabyte chunks. To ensure an even distribution of data,MongoDB will automatically migrate chunks between shards so that each shard has the same numberof chunks.This is why sharded clusters require routers and config servers. The config servers store clustermetadata including the mapping between chunks and shards. The routers are responsible formigrating chunks and updating the config servers. All read and write requests are sent to routersbecause only they can determine where the data resides.4.5 Couchbase ServerCouchbase Server partitions data into 1,024 virtual buckets, or vBuckets, and assigns them to nodes.Like MongoDB chunks, vBuckets store all data within a specific range. However, Couchbase Serverassigns all 1,024 virtual buckets when it is started, and it will not reassign vBuckets unless anadministrator initiates the rebalancing process.Couchbase Server clients maintain a cluster map that maps vBuckets to nodes. As a result, there isno need for routers or config servers. Clients communicate directly with nodes.4.6 SummaryImpact on AvailabilityCouchbase Server applications read and write directly to database nodes. However, MongoDBapplications read and write through a router. If the router becomes unavailable, the application canlose access to data.Couchbase Server topology configuration is shared by all nodes. However, MongoDB topologyconfiguration is maintained by config servers. If one or more config servers become unavailable, thetopology becomes static. MongoDB will not automatically rebalance data, nor will you be able to addor remove nodes.Impact on ScalabilityCouchbase Server is scaled by adding one or more nodes on demand. As such, it is easy to add asingle node to increase capacity when necessary. However, MongoDB is scaled by adding one or 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!7

more shards. Therefore, when it is necessary to increase capacity, administrators must add multiplenodes to create a replica set.Impact on PerformanceMongoDB recommends running routers next to application servers. However, as the number ofapplication servers increases, the number of routers increases and that can be a problem since therouters maintain open connections to MongoDB instances. Eventually, the routers will open too manyconnections and performance will degrade. The solution is to separate routers from applicationservers and reduce the number of them. While this alleviates the problem of too many openconnections, it limits users’ ability to scale routers by adding more.While MongoDB routers can send only one request over a connection, Couchbase Server clients cansend multiple requests over the same connection. Couchbase Server does not rely on the equivalentof routers, rather client send requests directly to nodes that own the data.MongoDB will migrate only one chunk at a time in order minimize the performance impact onapplications. However, migration happens automatically and will have some impact on performance.Couchbase Server will not migrate vBuckets unless an administrator initiates the rebalancing process.The enables administrators to plan when to rebalance data in order to avoid doing so during highworkloads.5. ReplicationWith multiple copies of data on different database servers, replication protects a database from theloss of a single server. Replication thus provides redundancy and increases data availability. It alsoallows recovery from hardware failures and service interruptions.In some cases, replication is used to increase read capacity. Clients have the ability to send read andwrite operations to different servers. Maintaining copies in different data centers increases the localityand availability of data for distributed applications.There are significant differences between how MongoDB and Couchbase Server handle thisfunctionality, stemming from their respective topologies.5.1 MongoDBIncreased data availability is achieved in MongoDB through Replica Set, which provides dataredundancy across multiple servers. Replica Set is a group of mongod instances that host the samedata set. One mongod, the primary, receives all data modification operations (insert / update / delete). 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!8

All other instances, secondaries, apply operations from the primary so that they have the same dataset.Because only one member can accept write operations, replica sets provide strict consistency for allreads from the primary. To support replication, the primary logs all data set changes in its oplog. Thesecondaries asynchronously replicate the primary’s oplog and apply the operations to their data sets.Secondaries’ data sets reflect the primary’s data set.It is possible for the client to set up Read Preference to read data from secondaries. This way theclient can balance loads from master to replicas, improving throughput and decreasing latency.However, as a result secondaries may not return the most recent data to the clients. Due to theasynchronous nature of the replication process, replica read preference guarantees only eventualconsistency for its data.FailoverWithin the replica set, members are interconnected with each other to exchange heartbeat messages.A crashed server with a missing heartbeat will be detected by other members and removed from thereplica set membership. After the dead secondary recovers, it can rejoin the cluster by connecting tothe primary, then catch up to the latest update.If a crash occurs over a lengthy period of time, where the change log from the primary doesn't coverthe whole crash period, then the recovered secondary needs to reload the whole data from theprimary as if it was a brand new server. 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!9

In case of a primary DB crash, a leader election protocol will be run among the remaining members tonominate the new primary, based on many factors such as the node priority, node uptime, etc. Aftergetting the majority vote, the new primary server will take its place.Geo-replicationMongoDB relies on standard replication to support multi data center deployments. In fact, a multi datacenter deployment consists of a single cluster that spans multiple data centers. It does so by placingprimaries and/or secondaries in different data centers. However, MongoDB is limited to unidirectionalreplication and does not support multi-master or active/active deployments. It can only replicate froma primary to one or more secondaries, and there can only be one primary per shard (or subset ofdata). While a subset of data can be read from multiple data centers, it can only be written to a singledata center.5.2 Couchbase ServerCouchbase Server nodes store both active and replica data. They execute read and write requests onthe active data, the subset of data they own, while maintaining copies of data owned by other nodes(replica data) to increase availability and durability. Couchbase Server does not implement primaryand secondary roles such that a node can store only active data or only replica data. When a clientwrites to a node in the cluster, Couchbase Server stores the data on that node, then distributes thedata to one or more nodes within a cluster.Replication within a single Couchbase Server cluster is configured through a single per-Bucketparameter—Replication Factor. As described earlier, Bucket is internally split into a number vBuckets;these vBuckets are evenly distributed across all the cluster nodes. Replica Factor is the number ofcopies of each vBucket that are evenly distributed across the cluster as well.The cluster administrator can influence the vMap only by specifying the rack awareness configuration,which gives vBucket system an additional clue on how to distribute the active and replica vBucketsacross the available cluster nodes.The following diagram shows two different nodes in a Couchbase cluster and illustrates how twonodes can store replica data for one another: 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!10

There are additional notes on Couchbase replication in Appendix A.FailoverFailover is the process in which a node in a Couchbase Server cluster is declared as unavailable, anda replica vBucket is promoted to an active state to start serving write operations.In case of node failure or cluster maintenance, the failover process contacts each server that wasacting as a replica and updates the internal table that maps client requests for documents to anavailable Couchbase Server.Failover can be performed manually or automatically using the built-in automatic failover process.Auto failover acts after a preset time, when a node in the cluster becomes unavailable.On the picture above, one server goes down and three replica vBuckets are promoted for masteringthe data after a failover process is finished. The load balance in this case is broken, as one server ismastering more data and gets more requests from clients. The rebalance process must be triggeredin this scenario. 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!11

Geo-replicationCouchbase Server implements a specialized protocol for geo-replication. Unlike MongoDB, a multidata center deployment of Couchbase Server consists of independent clusters that replicate data toeach other. It supports both unidirectional and bidirectional replication. In fact, any cluster can operatewithout any. As a result, it can be used not only for high availability and disaster recovery, but for fullread and write locality via multi-master or active/active deployments—all data can be read or writtenfrom any data center.5.3 SummaryImpact on availabilityCouchbase Server supports multi-master, or active/active, multi-data center deployments to providethe highest form of availability. However, MongoDB does not. MongoDB can only execute writes for aspecific subset of data in a single data center. With Couchbase Server, a failed data center wouldhave no impact on the application as requests could be immediately sent to a different data center.However, with MongoDB, applications would have to wait until it promoted one of its secondaries toprimary.Impact on scalabilityTo support a new geographic location, administrators simply deploy a new cluster and replicate itsdata via XDCR. However, with MongoDB, the existing cluster has to be updated by adding a newshard and extending existing shards.Because Couchbase Server deploys an independent cluster for each data center, there are nolimitations to how much data can be written to each data center. The clusters can be scaled out byadding more nodes with all nodes accepting writes.If MongoDB is deployed to multiple data centers by placing a primary node—a node that can performwrites—in each data center. That means the amount of data that can be written to each data center islimited by a single node. It is possible to add more primary nodes by sharding the cluster further, but itadds more maintenance, more complexity, and more overhead.Impact on performanceCouchbase Server can increase both read and write performance with geo-replication. It enables allapplications to be able to read and write all data to nearest data center. However, MongoDB is limitedto increasing read performance. While it can enable local reads, it still relies on remote writes andremote writes suffer from high latency.Detailed additional information on this topic and for the summary can be found in Appendix A. 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!12

6. CachingA memory cache improves database performance by keeping frequently accessed data in systemmemory for quick retrieval. Requests are received by the server and are checked against the memorycache. With NoSQL databases, issues of how data is stored in memory, the granularity of the data,and how the database communicates with the application and presentation levels are important toconsider.6.1 MongoDBMongoDB’s MMAP storage engine uses employs memory-mapped files that rely on the operatingsystem page cache, mapping 4096-byte pages of file data to memory. Its WiredTiger engine (part ofMongoDB since the acquisition of WiredTiger in December 2014 has an uncompressed block cacheas well as a compressed OS page cache.Writes remain in memory until MongoDB calls “fsync.” With the default storage engine, MMAP, that isevery 100 ms for the journal and 60 s for the data. With the WiredTiger storage engine, that is every60 s or 2 GB of data. If called data is not in memory, then MongoDB uses page faults (soft or hard) toread data.6.2 Couchbase ServerCouchbase Server employs a managed object cache to cache individual documents. Writes godirectly to the cache, with a pool of threads continuously flushing new and modified documents todisk. If data is not in memory, Couchbase reads from disk.An eviction process ensures there is room to read from disk and place in memory. Couchbase hastwo options here: only evict documents and retain all metadata (optimal when low latency is required),or evict documents and metadata (optimal with massive data sets and super-low latency access toworking docs is required).6.3 Impact on PerformanceMongoDB caching is coarse-grained, caching blocks of documents. Removing a block can thereforeremove many documents from memory. Performance drops when data is not in the OS page cache,and it’s possible to cache data twice (in block cache and in WiredTiger’s OS page cache. 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!13

In contrast, Couchbase Server caching is fine-grained, caching individual documents. Users canremove or add single documents to the cache, and the database does not rely on the OS page cache.Couchbase Server supports the memcached protocol, and there is no need for a separate cachingtier.Thus, there are options to optimize performance for small and large datasets with differing accesspatterns. In addition, because Couchbase Server does not rely on the OS page cache, maintenanceoperations that scan the entire data set will not impact the cache. For example, compaction will notresult in frequently read item being replaced with random items.7. ConclusionThis paper examined architecture and why it is important, from an enterprises’ initial analysis ofNoSQL database options through its actual deployment and day-to-day usage. Architecture clearlyhas a significant and direct impact on performance, scalability, and availability. The two databasesunder examination here may seem similar at first glance, as they are both NoSQL document-orienteddatabases designed to handle a variety of unstructured data.However, it is clear that depending on an enterprise’s unique needs—including the size of itsdatasets, how quickly data must be available and what percentage of the data must be availablequickly, and perhaps most important, the ability for the database to scale with the enterprise—there islikely a mission-critical need for an enterprise to choose the database that best fits the complexities ofits needs.At Altoros, we not only encourage companies to contact the respective technology vendors directlywith a long list of questions and requirements, but we are also available for deeper analysis of specificscenarios and use cases.8. Appendix A: Notes on Main Topics8.1 Notes on TopologyWhen mongos is installed on the MongoDB client machine, the two NoSQL solutions MongoDB andCouchbase begin to look more similar at the high level. In both deployments, for example, clientsappear to be aware of cluster topology.However, the Config Servers and Replica Sets for MongoDB require a much more complicatedinstallation and configuration procedure than a single-package installation for Couchbase Server. 1 (650) 265-2266engineering@altoros.comwww.altoros.com twitter.com/altorosClick for moreNoSQL research!14

8.2 Notes on PartitioningMongoDBLogical. MongoDB documents are logically grouped into collections. A collection can be consideredas an equivalent for an RDBMS table. Collections exist within databases, which are the physicalcontainers for data (sets of files on the filesystem). Collections do not enforce any schema.Besides an ordinary type of document collection, MongoDB also provides capped collections—fixedsize collections that support high-throughput operations that insert and retrieve documents based oninsertion order. Capped collections work like circular buffers: once a collection fills its allocated space,it makes room for new documents by overwriting the oldest documents in the collection. Cappedcollections are actually the only way to limit physical resources dedicated to a particular collection.The collection can be limited both in the maximum number of records and total data set size.Since Mo

The two databases under examination here offer profoundly different topologies. MongoDB takes a hierarchical, master/slave approach, while Couchbase Server has a peer-to-peer (P2P) topology. 4.1 MongoDB MongoDB deployment architecture is very flexible and configurable, with a dominating hierarchical topology type.